DBMS2

Subscribe to DBMS2 feed
Choices in data management and analysis
Updated: 4 hours 25 min ago

Some things I think about politics

Wed, 2018-02-07 20:04

When one tries to think comprehensively about politics these days, it quickly gets overwhelming. But I think I’ve got some pieces of the puzzle figured out. Here they are in extremely summarized form. I’ll flesh them out later as seems to make sense.

1. Most of what people are saying about modern tribalism is correct. But partisanship is not as absolute as some fear. In particular:

2. The threat from Trump and his Republican enablers is indeed as bad as people fear. He’s a major danger to do terrible, irreversible harm to the US and the rest of the world. To date the irreversible damage hasn’t been all that terrible, but if Trump and his enablers are given enough time, the oldest modern democracy will be no more.

All common interests notwithstanding, beating Trump’s supporters at the polls is of paramount importance.

3. I agree with those who claim that many of our problems stem from the shredding of trust. But few people seem to realize just how many different aspects of “trust” there are, nor how many degrees there can be of trustworthiness. It’s not just a binary choice between “honest servant of the people” and “lying, cheating crook”.

These observations have strong analogies in IT. What does it mean for a system to be “reliable” or to produce “accurate” results? There are many possible answers, each reasonable in different contexts.

4. I also agree with the view that much of what’s going on relates to lacks of empathy. But it’s not quite as simple as saying that liberals/Democrats/globalists have more empathy, while conservatives/Republicans/populists/nationalists have less. Populists and white nationalists likely have more empathy than I do for certain segments of the population, and anti-abortion zealots surely outshine me in empathy for fetal tissue.

Some say our troubles are due to a deliberate war on truth and democracy. Some say they’re just consequences of broad, long-running trends. I think both views are partially correct.

I’ll make a short post on that point soon, and slightly edit this one accordingly when I do.

5. Much is made these days of people’s penchant for irrationality, which generally takes the forms:

  • Irrational choices as to which factual claims to accept.
  • Irrational conclusions in light of their chosen “facts”.

I think that a lot of this irrationality can be explained as people not taking the trouble to gain all the facts, to think things through, etc. Indeed, perfect rationality takes so much effort that it would be … well, that it would be a highly irrational choice. So if we want people to be more rational, perhaps we should make it easier for them to be so.

That challenge has many different facets. I hope to have something useful to say about it later on.

6. Outright changing somebody’s mind is very, very hard. But making them less sure of their opinion? That’s a lot easier. Making them more sure of it? That’s a reasonable goal as well

This too will be spelled out in a future post.

7. Much of the prevailing irrationality can be modeled by describing which contradictions/doublethink people accept, and in which cases they think a contradiction actually proves that something is untrue. And people’s views are sometimes actually influenced by a pull to be more consistent. Real-life examples include:

  • Some traditionally “law and order”/authority-following right-wingers who believe the current allegations about the “Deep State” are more open to doubting FBI claims in general.
  • Similarly, the recent FISA legislation needed bipartisan support to pass, because some generally government-skeptical Republicans were in particular skeptical of the alleged national-security reasons for domestic snooping.
  • States-rights supporters (who usually are conservatives) can extend that to disapproving of Federal marijuana laws and crackdowns.
Categories: Other

Politics can be overwhelming

Wed, 2018-02-07 20:03

Like many people, I’ve been shocked and saddened by recent political developments. What I’ve done about it includes (but is not limited to):

  • Vented, ranted and so on. That’s somewhat therapeutic, and also let me engage the other side and try to understand a little better how they think.
  • Tried to understand what’s happening. I probably have had more available time to do that than most people. I also have a variety of relevant experiences to bring to bear.
  • Neglected my work somewhat while doing all that. This neglect has now stopped. After all, the future is quite uncertain, so we should probably work hard in the present while business is still good.
  • Written up some of what I’ve figured out. Of course. That’s what I do. But it’s only “some”, because … well, the entirety of politics is overwhelming.
  • Tried to find specific, actionable ways to help. Stay tuned for more on that part.

As for those writings:

  • I just posted a very high-level overview of modern political complexities. Please read it.
  • I’m working on posts drilling down on various parts of that. Closest to readiness are ones on “Modifying beliefs” (which will include some technology marketing advice) and “The war on truth and democracy” (which will argue that part — and only part — of what’s going on is properly described by the “war” metaphor).
  • I recently posted that the tech industry is under broad political attack. That’s even more true than I realized. Two recent and indicative developments are:
    • Roger McNamee et al. have started an organization to combat the addictive evils they perceive the tech/internet industry as doing.
    • George Soros — whose organization was once my best-paying investment client — thundered at Davos that the tech/internet industry should and will be brought down by antitrust regulators.
  • I also posted recently about the chaotic politics of privacy. If anything, the ongoing FBI/FISA firestorm suggests that I understated the matter.
Categories: Other

The chaotic politics of privacy

Mon, 2018-01-22 09:23

Almost nobody pays attention to the real issues in privacy and surveillance. That’s gotten only slightly better over the decade that I’ve written about the subject. But the problems with privacy/surveillance politics run yet deeper than that.

Worldwide

The politics of privacy and surveillance are confused, in many countries around the world. This is hardly surprising. After all:

  • Privacy involves complex technological issues. Few governments understand those well.
  • Privacy also involves complex business issues. Few governments understand those well either.
  • Citizen understanding of these issues is no better.

Technical cluelessness isn’t the only problem. Privacy issues are commonly framed in terms of civil liberties, national security, law enforcement and/or general national sovereignty. And these categories are inherently confusing, in that:

  • Opinions about them often cross standard partisan lines.
  • Different countries take very different approaches, especially in the “civil liberties” area.
  • These categories are rife with questionably-founded fears, such as supposed threats from terrorism, child pornographers, or “foreign interference”.

Data sovereignty regulations — which are quite a big part of privacy law — get their own extra bit of confusion, because of the various purposes they can serve. Chief among these are: 

  • Preventing foreign governments or businesses from impinging citizens’ privacy.
  • Helping their own governments impinge on citizens’ privacy.
  • Providing a pretext to favor local companies at the expense of foreign ones.

The United States

Specifically in the United States, I’d like to drill into two areas:

  • An important bit of constitutional confusion.
  • Just how bipartisan this all gets in our generally hyper-partisan times.

The constitutional confusion goes something like this:

  • A new communication technology is invented, such as telephones or email.
  • The courts rule that there is no Fourth Amendment expectation of privacy in using such optional services, because:
    • Given how the technology works, the information is temporarily under a third party’s control.
    • If you weren’t willing to give up your privacy, you wouldn’t have used the technology in the first place.
  • Later the technology becomes so central to everyday life that courts start finding the previous reasoning to be inaccurate, and extend the Fourth’s protection of your “papers and effect” to the new communication medium.
  • In the meantime, laws are passed regulating privacy for that particular medium.

For example:

Those links are all to Wikipedia. At the time of this writing, the ones on Warshak and the SCA go into considerable constitutional depth.

The Email Privacy Act is also the single best example of this post’s premises about the general chaos of privacy politics.

  • It passed the House of Representatives unanimously in 2016 — 419-0 — which is an honor usually reserved for such noncontroversial subjects as renaming post offices.
  • Even so, it was shot down in the Senate, under opposition from Senators of both parties,* never coming up for vote.
  • It was passed by voice vote in the House again in 2017.
  • It again didn’t come up for vote in the Senate.

Last week’s FISA reauthorization is another example; it wouldn’t have passed without senior-level Democratic support in the House and Senate alike.

*A chief opponent among the Democrats was Diane Feinstein, who despite representing California is commonly hostile to technological good sense. She voted for FISA reauthorization as well.

Like many folks, I’ve been distracted by all the other political calamities that have befallen since November, 2016. But the time to refocus on privacy/surveillance is drawing near.

Related links

  • I wrote about similar subjects in May, 2016, and offered many links then.
Categories: Other

The technology industry is under broad political attack

Fri, 2017-12-15 03:25

I apologize for posting a December downer, but this needs to be said.

The technology industry is under attack:

  • From politicians and political pundits …
  • … especially from “populists” and/or the political right …
  • … in the United States and other countries.

These attacks:

  • Are in some cases specific to internet companies such as Google and Facebook.
  • In some cases threaten the tech industry more broadly.
  • Are in some cases part of general attacks on the educated/ professional/“globalist”/”coastal” “elites”.

You’ve surely noticed some of these attacks. But you may not have noticed just how many different attacks and criticisms there are, on multiple levels.

1. Concerns about jobs, disruption, gentrification and so on are a Really Big Deal, causing large swaths of the population to regard technology as bad for their pocketbooks. In particular:

  • There’s tremendous concern about job loss to automation and/or globalization. Technology helps cause the first and enable the second.
  • Generally, when an industry destroys jobs, one hopes that it will create new ones to take their place. But while US technology companies have created many jobs, a lot of those are overseas.
  • Flaps about overseas finances, taxes, and so on aren’t helping. Apple, for example, has major issues in Europe and the US alike.
  • Working-class jobs that tech companies do create are often attacked for their pay and conditions, e.g. for Amazon warehouse workers or Uber drivers.
  • Even when the technology industry unquestionably creates good, domestic jobs, the industry may be attacked for them. Consider for example the concerns about cost of living/gentrification in Northern California.
  • “Sharing economy” companies such as Uber and Airbnb and others are involved in local political fights all around the world, as they undercut traditional service providers.

People who believe that technologists harm them are a major political force.

2. The technology industry is under considerable legislative, regulatory, and judicial pressure. For starters:

  • Tech companies are attacked for doing too little to aid law enforcement and government surveillance.
  • Tech companies are attacked for doing too much to aid law enforcement and government surveillance.
  • Tech companies are attacked for doing too little censorship.
  • Tech companies are attacked for doing too much censorship.
  • Privacy regulations are ever-changing.

Complicating things further, these challenges take different forms in different countries around the world.

Also:

  • China pressures foreign vendors to transfer technology into China.
  • Recent network neutrality developments in the US favor older telecom providers, at the expense of newer internet companies.
  • Anti-immigrant policies in the US threaten tech vendors.

I could keep going much longer than that. Government relations are a major, major issue for tech.

3. It is traditional to claim that advances in communication/media technologies will wreck society.

  • Television was going to make us mass-conformist couch potatoes.
  • Video games were going to make us violent couch potatoes.

This era brings similar concerns.

  • Social media makes us couch potatoes sitting in niche-conformist echo chambers.
  • Modern media over-stimulate us and wreck our attention spans.

I.e., the apocalypse is imminent, and tech is what will bring it on.

The most compelling version of that argument I’ve seen is Jean Twenge’s claims that there’s a teen mental health crisis perfectly matched in time to the rise of the smartphone. And to make any such claim seem particularly damning, please recall: Social media and gaming companies are clearly trying to foster a form of addiction in — well, in their users.

Current concern may ebb just like previous generations’ did. But for now, they’re yet another aspect of a threat-filled environment.

4. What worries me most is this: The United States and other countries face relentless attacks on education, educators, science, scientists, and rationality itself. And there are no obvious limits to how bad these can get. China’s Cultural Revolution and the Cambodian genocide happened during my lifetime. Stalin and Hitler ruled during my parents’. All four took particular aim at people like us.

Bottom line: EVERYBODY in the technology industry should be or quickly become politically aware. We have an awful lot of politics to deal with.

Categories: Other

Notes on artificial intelligence, December 2017

Tue, 2017-12-12 12:53

Most of my comments about artificial intelligence in December, 2015 still hold true. But there are a few points I’d like to add, reiterate or amplify.

1. As I wrote back then in a post about the connection between machine learning and the rest of AI,

It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response.

2. Accordingly, it can be reasonable to equate machine learning and AI.

  • AI based on machine learning frequently works, on more than a toy level. (Examples: Various projects by Google)
  • AI based on knowledge representation usually doesn’t. (Examples: IBM Watson, 1980s expert systems)
  • “AI” can be the sexier marketing or fund-raising term.

3. Similarly, it can be reasonable to equate AI and pattern recognition. Glitzy applications of AI include:

  • Understanding or translation of language (written or spoken as the case may be).
  • Machine vision or autonomous vehicles.
  • Facial recognition.
  • Disease diagnosis via radiology interpretation.

4. The importance of AI and of recent AI advances differs greatly according to application or data category. 

  • Machine learning and AI have little relevance to most traditional transactional apps.
  • Predictive modeling is a huge deal in customer-relationship apps. The most advanced organizations developing and using those rely on machine learning. I don’t see an important distinction between machine learning and “artificial intelligence” in this area.
  • Voice interaction is already revolutionary in certain niches (e.g. smartphones — Siri et al.). The same will likely hold other natural language or virtual/augmented reality interfaces if and when they go more mainstream. AI seems likely to make a huge impact on user interfaces.
  • AI also seems likely to have huge impact upon the understanding and reduction of machine-generated data.

5. Right now it seems as if large companies are the runaway leaders in AI commercialization. There are several reasons to think that could last.

  • They have deep pockets. Yes, but the same is true in any other area of technology. Small companies commonly out-innovate large one even so.
  • They have access to lots of data for model training. I find this argument persuasive in some specific areas, most notably any kind of language recognition that can be informed by search engine uses.
  • AI technology is sometimes part of a much larger whole. That argument is not obviously persuasive. After all, software can often be developed by one company and included as a module in somebody else’s systems. Machine vision has worked that way for decades.

I’m sure there are many niches in which decision-making, decision implementation and feedback are so tightly integrated that they all need to be developed by the same organization. But every example that remotely comes to mind is indeed the kind of niche that smaller companies are commonly able to address.

6. China and Russia are both vowing to lead the world in artificial intelligence. From a privacy/surveillance standpoint, this is worrisome. China also has a reasonable path to doing so (Russia not so much), in line with the “Lots of data makes models strong” line of argument.

The fiasco of Japan’s 1980s “Fifth-Generation Computing” initiative is only partly reassuring.

7. It seems that “deep learning” and GPUs fit well for AI/machine learning uses. I see no natural barriers to that trend, assuming it holds up on its own merits.

  • Since silicon clock speeds stopped increasing, chip power improvements have mainly taken the form of increased on-chip parallelism.
  • The general move to the cloud is also not a barrier. I have little doubt major cloud providers could do a good job of providing GPU-based capacity, given that:
  • They build their own computer systems.
  • They showed similar flexibility when they adopted flash storage.
  • Several of them are AI research leaders themselves.

Maybe CPU vendors will co-opt GPU functionality. Maybe not. I haven’t looked into that issue. But either way, it should be OK to adopt software that calls for GPU-style parallel computation.

8. Computer chess is in the news, so of course I have to comment. The core claim is something like:

  • Google’s AlphaZero technology was trained for four hours playing against itself, with no human heuristic input.
  • It then decisively beat Stockfish, previously the strongest computer chess program in the world.

My thoughts on that start:

  • AlphaZero actually beat a very crippled version of Stockfish.
  • That’s still impressive.
  • Google only released a small fraction of the games. But in the ones it did release, about half had a common theme — AlphaZero seemed to place great value on what chess analysts call “space”.
  • This all fits my view that recent splashy AI accomplishments are focused on pattern recognition.
Categories: Other

Imanis Data

Tue, 2017-08-22 07:46

I talked recently with the folks at Imanis Data. For starters:

  • The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
  • The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
  • As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
  • “Imanis” is a new name; the previous name was “Talena”.

Also:

  • Imanis has ~35 subscription customers, a significant majority of which are in the Fortune 1000.
  • Customer industries, in roughly declining order, include:
    • Financial services other than insurance.
    • Insurance.
    • Retail.
    • “Technology”.
  • ~40% of Imanis customers are in the public cloud.
  • Imanis is focused on the North American market at this time.
  • Imanis has ~45 employees.
  • The Imanis product just hit Version 3.

Imanis correctly observes that there are multiple reasons you might want to recover from backup, including:

  • General disaster/system failure.
  • Bug in an application that writes data.
  • Malicious acts, including encryption-by-ransomware.

Imanis uses the phrase “point-in-time backup” to emphasize its flexibility in letting you choose your favorite time-version of your rolling backup.

Imanis also correctly draws the inference that the right backup strategy is some version of:

  • Make backups very frequently. This boils down to “Do a great job of making incremental backups (and restoring from them when necessary). This is where Imanis has spent the bulk of its technical effort to date.
  • In case recovery is needed, identify that last clean (or provably/confidently clean) version of the database and restore from that. The identification part boils down to letting the backup databases be queried directly. That’s largely a roadmap item.
    • Imanis has recently added the capability to build its own functionality querying the backup database.
    • JDBC/whatever general access is still in the future.

Note: When Imanis backups offer direct query access, the possibility will of course exist to use the backup data for general query processing. But while that kind of capability sounds great in theory, I’m not aware of it being a big deal (on technology stacks that already offer it) in practice.

The most technically notable other use cases Imanis mentioned are probably:

  • Data science dataset generation. Imanis lets you generate a partial copy of the database for analytic or test purposes.
    • You can project, select or sample your data, which suggests use of the current query capabilities.
    • There’s an API to let you mask Personally Identifiable Information by writing your own data transformations.
  • Archiving/tiering/ILM (Information Lifecycle Management). Imanis lets you divide data according to its hotness.

Imanis views its competition as:

  • Native utilities of the data stores.
  • Hand-coded scripts.
  • Datos.io, principally in the Cassandra market (so far).

Beyond those, the obvious comparison to Imanis is Delphix. I haven’t spoken with Delphix for a few years, but I believe that key differences between Delphix and Imanis start:

  • Delphix is focused on widely-installed RDBMS such as Oracle.
  • Delphix actually tries to have different production logical copies of your database run off of the same physical copy. Imanis, in contrast, offers technology to help you copy your databases quickly and effectively, but the copies you actually use will indeed be separate from each other.

Imanis software runs on its own cluster, based on hacked Hadoop. A lot of the hacking seems to related to a metadata store, which supports things like:

  • Understanding which (incrementally backed up) blocks need to be pulled together to make a specific copy of the database.
  • Putting data in different places for ILM/tiering.

Another piece of Imanis tech is machine-learning-based anomaly detection.

  • As incrementally backed-up blocks arrive, Imanis flags anomalous ones and states a reason for them.
  • A flag is given a reason.
  • You can denounce the flag as a false alert, and hopefully similar flags won’t be raised in the future.

The technology for this seems rather basic:

  • Random forests for the flagging.
  • No drilldown w/in the Imanis system for follow-up.

But in general concept this is something a lot more systems should be doing.

Most of the rest of Imanis’ tech story is straightforward — support various alternatives for computing platforms, offer the usual security choices, etc. One exception that was new to me was the use of erasure codes, which seem to be a generalization of the concept of parity bits. Allegedly, when used in a storage context these have the near-magical property of offering 4X replication safety with only a 1.5X expansion of data volume. I won’t claim to have understood the subject well enough to see how that could make sense, or what tradeoffs it would entail.

Categories: Other

More notes on the transition to the cloud

Thu, 2017-08-17 04:11

Last year I posted observations about the transition to the cloud. Here are some further thoughts.

0. In case any doubt remained, the big questions about transitioning to the cloud are “When?” and “How?”. “Whether”, by way of contrast, is pretty much settled.

1. The answer to “When?” is generally “Over many years”. In particular, at most enterprises the cloud transition will span multiple CIO’s tenure in their positions.

Few enterprises will ever execute on simple, consistent, unchanging “cloud strategies”.

2. The SaaS (Software as a Service) vs. on-premises tradeoffs are being reargued, except that proponents now spell SaaS C-L-O-U-D. (Ali Ghodsi of Databricks made a particularly energetic version of that case in a recent meeting.)

3. In most countries (at least in the US and the rest of the West), the cloud vendors deemed to matter are Amazon, followed by Microsoft, followed by Google. And so, when it comes to the public cloud, Microsoft is much, much more enterprise-savvy than its key competitors.

4. In another non-technical competitive factor: Wal-Mart isn’t the only huge company that is hostile to the Amazon cloud because of competition with other Amazon businesses.

5. It was once thought that in many small countries around the world, there would be OpenStack-based “national champion” cloud winners, perhaps as subsidiaries of the leading telecom vendors. This doesn’t seem to be happening.

Even so, some of the larger managed-economy and/or generally authoritarian countries will have one or more “national champion” cloud winners each — surely China, presumably Russia, obviously Iran, and probably some others as well.

6. While OpenStack in general seems to have fizzled, S3 compatibility has momentum.

7. Finally, let’s return to our opening points: The cloud transition will happen, but it will take considerable time. A principal reason for slowness is that, as a general rule, apps aren’t migrated to platforms directly; rather, they get replaced by new apps on new platforms when the time is right for them to be phased out anyway.

However, there’s a codicil to those generalities — in some cases it’s easier to migrate to the new platform than in others. The hardest migration was probably when the rise of RDBMS, the shift from mainframes to UNIX and the switch to client/server all happened at once; just about nothing got ported from the old platforms to the new. Easier migrations included:

  • The switch from Unix to Linux. They were very similar.
  • The adoption of virtualization. A major purpose of the technology was to make migration easy.
  • The initial adoption of DBMS. Then-legacy apps relied on flat file systems, which DBMS often found easy to emulate.

The cloud transition is somewhere in the middle between those extremes. On the “easy” side:

  • Popular database management technologies and so on are available in the cloud just as they are on-premise.
  • Major app vendors are doing the hard work of cloud ports themselves.

Nonetheless, the public cloud is in many ways a whole new computing environment — and so for the most part, customer-built apps will prove too difficult to migrate. Hence my belief that overall migration to the cloud will be very incremental.

Categories: Other

Notes on data security

Thu, 2017-08-10 04:15

1. In June I wrote about burgeoning interest in data security. I’d now like to add:

  • Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
  • In an exception to that general rule, many enterprise have vague mandates for data encryption.
  • In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.

We can reconcile these anecdata pretty well if we postulate that:

  • Enterprises generally agree that data security is an important need.
  • Exactly how they meet this need depends upon what regulators choose to require.

2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically:

  • The freer non-English-speaking countries are more concerned about ensuring data privacy. In particular, the European Union’s upcoming GDPR (General Data Protection Regulation) seems like a massive addition to the compliance challenge.
  • The “Five Eyes” (US, UK, Canada, Australia, New Zealand) are more concerned about maintaining the efficacy of surveillance.
  • Authoritarian countries, of course, emphasize surveillance as well.

3. Multiple people have told me that security concerns include (data) lineage and (data) governance as well. I’m fairly OK with that conflation.

  • By citing “lineage” I think they’re referring to the point that if you don’t know where data came from, you don’t know if it’s trustworthy. This fits well with standard uses of the “data lineage” term.
  • By “data governance” they seem to mean policies and procedures to limit the chance of unauthorized or uncontrolled data change, or technology to support those policies. Calling that “data governance” is a bit of a stretch, but it’s not so ridiculous that we need to make a big fuss about it.

In other words: If your data transformation pipelines aren’t locked down, then your data isn’t locked down either.

4. But how seriously does that last point need to be taken? For starters, the possibility of erroneous calculations:

  • Is a strong threat to analytic accuracy, as has been recognized at least for the decades that “one version of the truth” has been a catchphrase.
  • Has some regulatory risk, e.g. in the United States around Sarbanes-Oxley.
  • Is not as a big a deal for the core security threat of data theft/exfiltration.

Further, it’s not too hard architecturally to have a divide between:

  • Data transformation for operational use cases, which may need to be locked down.
  • Data transformation for purely investigative analytics, which can be very fluid, for transformation technologies such as Hadoop, Spark and Excel alike.

Bottom line: Data transformation security is an accessible must-have in some use cases, but an impractical nice-to-have in others.

Categories: Other

Analytics on the edge?

Fri, 2017-06-30 03:27

There’s a theory going around to the effect that:

  • Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
  • Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
  • Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration. 

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

  • Machine vision or other “recognition”-oriented areas of AI.
  • Detection or prediction of malfunctions.
  • Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

  • Huge amounts of data are collected and are used to make real-time decisions.
  • The models are trained centrally, and updated remotely over time as they are improved.
  • The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

  • A model is widely deployed.
  • The model does a decent job but not a perfect one.
  • Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

  • Differ for different (categories) of applications.
  • Rely in most cases on simple patterns of data movement, such as:
    • Stream everything to central servers and sort it out there, or if that’s not workable …
    • … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
    • Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

  • Is designed for exactly that use case.
  • Went GA early this year.

As always, technology is in flux.

Related links

Categories: Other

Generally available Kudu

Fri, 2017-06-16 10:52

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

  • Security is an ever bigger deal.
  • There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
    • Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
    • Flash is often — but not yet always — preferred over disk for that kind of use.
    • Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
  • Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

  • A data storage system introduced by Cloudera (and subsequently open-sourced).
  • Columnar.
  • Updatable in human real-time.
  • Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

  • Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
  • A subsequent release with some basic security features spawned another uptick.
  • I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
  • But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

  • Solid-state storage is recommended, with a few terabytes per node.
  • You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
  • Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
  • There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
  • As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

  • Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
  • You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
  • Alternatively, you can write code to handle duplication errors, e.g. via Spark.
Categories: Other

The data security mess

Wed, 2017-06-14 08:21

A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:

  • Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
  • A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
  • The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
  • Traditionally paranoid industries are still paranoid.
  • Other industries are newly (and rightfully) terrified of exposing customer data.
  • My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.

*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.

Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):

  • Easy, comprehensive access control. More on this below.
  • Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
  • Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
  • Whatever regulators mandate.
  • Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
  • Whatever the government is known to use. This is a common proxy for “best practices”.

More specific or extreme requirements include: 

I don’t know how widely these latter kinds of requirements will spread.

The most confusing part of all this may be access control.

  • Security has a concept called AAA, standing for Authentication, Authorization and Accounting/Auditing/Other things that start with”A”. Yes — even the core acronym in this area is ill-defined.
  • The new standard for authentication is Kerberos. Or maybe it’s SAML (Security Assertion Markup Language). But SAML is actually an old, now-fragmented standard. But it’s also particularly popular in new, cloud use cases. And Kerberos is actually even older than SAML.
  • Suppose we want to deny somebody authorization to access certain raw data, but let them see certain aggregated or derived information. How can we be sure they can’t really see the forbidden underlying data, except through a case-by-case analysis? And if that case-by-case analysis is needed, how can the authorization rules ever be simple?

Further confusing matters, it is an extremely common analytic practice to extract data from somewhere and put it somewhere else to be analyzed. Such extracts are an obvious vector for data breaches, especially when the target system is managed by an individual or IT-weak department. Excel-on-laptops is probably the worst case, but even fat-client BI — both QlikView and Tableau are commonly used with local in-memory data staging — can present substantial security risks. To limit such risks, IT departments are trying to impose new standards and controls on departmental analytics. But IT has been fighting that war for many decades, and it hasn’t won yet.

And that’s all when data is controlled by a single enterprise. Inter-enterprise data sharing confuses things even more. For example, national security breaches in the US tend to come from government contractors more than government employees. (Ed Snowden is the most famous example. Chelsea Manning is the most famous exception.) And as was already acknowledged above, even putting your data under control of a SaaS vendor opens hard-to-plug security holes.

Data security is a real mess.

Categories: Other

Light-touch managed services

Wed, 2017-06-14 08:14

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

  • Altus manages jobs for you.
  • But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are: 

  • The user has some kind of environment that manages data and executes programs.
  • The light-touch service, running outside this environment, spawns one or more app processes inside it.
  • Useful work ensues …
  • … with acceptable reliability and performance.
  • The environment’s security guarantees ensure that data doesn’t leak out.

Cases where that doesn’t even make sense include but are not limited to:

  • Transaction-processing applications that are carefully tuned for efficient database access.
  • Applications that need to be carefully installed on or in connection with a particular server, DBMS, app server or whatever.

On the other hand:

  • A light-touch service is at least somewhat reasonable in connection with analytics-oriented data-management-plus-processing environments such as Hadoop/Spark clusters.
  • There are many workloads over Hadoop clusters that don’t need efficient database access. (Otherwise Hive use would not be so prevalent.)
  • Light-touch efforts seem more likely to be helped than hurt by abstraction environments such as the public cloud.

So we can imagine some kind of outside service that spawns analytic jobs to be run on your preferred — perhaps cloudy — Hadoop/Spark cluster. That could be a safe way to get analytics done over data that really, really, really shouldn’t be allowed to leak.

But before we anoint light-touch managed services as the NBT (Next Big Thing/Newest Bright Thought), there’s one more hurdle for it to overcome — why bother at all? What would a light-touch managed service provide that you wouldn’t also get from installing packaged software onto your cluster and running it in the usual way? The simplest answer is “The benefits of SaaS (Software as a Service)”, and so we can rephrase the challenge as “Which benefits of SaaS still apply in the light-touch managed service scenario?”

The vendor perspective might start, with special cases such as Cloudera Altus excepted:

  • The cost-saving benefits of multi-tenancy mostly don’t apply. Each instance winds up running on a separate cluster, namely the customer’s own. (But that’s likely to be SaaS/cloud itself.)
  • The benefits of controlling your execution environment apply at best in part. You may be able to assume the customer’s core cluster is through some cloud service, but you don’t get to run the operation yourself.
  • The benefits of a SaaS-like product release cycle do mainly apply.
    • Only having to support the current version(s) of the product is a little limited when you don’t wholly control your execution environment.
    • Light-touch doesn’t seem to interfere with the traditional SaaS approach of a rapid, incremental product release cycle.

When we flip to the user perspective, however, the idea looks a little better.

Bottom line: Light-touch managed services are well worth thinking about. But they’re not likely to be a big deal soon.

Categories: Other

Cloudera Altus

Wed, 2017-06-14 08:12

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

  • Provide all the important advantages of on-premises Cloudera.
  • Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
  • Benefit from customers’ desire to have on-premises and cloud deployments that work:
    • Alike in any case.
    • Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

  • Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
    • One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
    • Cloudera Director was Cloudera’s first foray into cloud-specific management.
    • Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
    • Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
  • Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
  • Price. When price is the major determinant, Cloudera is sad.
  • Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

  • Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
    • Interacting with S3 data stores.
    • Spinning instances up and down.
  • Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

  • Cloudera Altus doesn’t manage data for you.
  • Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

  • Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
  • Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Categories: Other

Interana

Mon, 2017-04-17 05:10

Interana has an interesting story, in technology and business model alike. For starters:

  • Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
  • Interana has a full-stack analytic offering, include:
    • Its own columnar DBMS …
    • … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
    • … there also are BI-like visual analytics tools that support plenty of drilldown.
  • Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
  • Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

  • For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
  • However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

  • Interana’s DML is focused on path analytics …
    • … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
    • Interana may be the first company that’s ever told me it’s focused on providing a better nPath. :)
  • Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
  • As typical example questions or analytic subjects, Interana offered:
    • “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
    • Exactly which steps in the onboarding process result in the greatest user frustration?
  • The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
  • The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
  • To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

  • Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
  • Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
  • Interana installations and workloads to date have gotten as large as:
    • 1-200 nodes.
    • Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
    • Billions of rows/events received per day.
    • 100s of 1000s of (very sparse) columns.
    • 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

  • They’re serious about micro-batching.
    • If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
    • Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
  • They’re casual about schemas.
    • Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
      • Interana observes, correctly, that log data often is decently structured.
        • For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
        • Interana calls this “logging with intent”.
      • Interana is fine with a certain amount of JSON (for example) schema change over time.
      • If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
    • JSON hierarchies turn into multi-part column names in the usual way.
    • Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

  • Compression is a central design consideration …
    • … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
    • Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
    • Column data is sorted. A big part of the reason is of course to aid compression.
    • Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
  • As you would think, Interana technically includes multiple data stores.
    • Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
    • Asynchronously, the data is broken into columns, and banged to “disk”.
    • Asynchronously again, the data is sorted.
    • Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
  • Interana lets you shard different replicas of the data according to different shard keys.
  • Interana is proud of the random sampling it does when serving approximate query results.
Categories: Other

Analyzing the right data

Thu, 2017-04-13 07:05

0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.

1. In line with that theme:

  • Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
  • Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.

2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.

*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.

3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:

  • Divide your data into clusters.
  • Model each cluster separately.

That continues to be tough work. Attempts to productize shortcuts have not caught fire.

4. In an example of the previous point, anomaly management technology can, in theory, help shortcut any type of analytics, in that it tries to identify what parts of your data to focus on (and why). But it’s in its early days; none of the approaches to general anomaly management has gained much traction.

5. Marketers have vast amounts of information about us. It starts with every credit card transaction line item and a whole lot of web clicks. But it’s not clear how many of those (10s of) thousands of columns of data they actually use.

6. In some cases, the “right” amount of data to use may actually be tiny. Indeed, some statisticians claim that fewer than 10 data points may be enough to get a good model. I’m skeptical, at least as to the practical significance of such extreme figures. But on the more plausible side — if you’re hunting bad guys, it may not take very many separate facts before you have good evidence of collusion or fraud.

Internet fraud excepted, of course. Identifying that usually involves sifting through a lot of log entries.

7. All the needle-hunting in the world won’t help you unless what you seek is in the haystack somewhere.

  • Often, enterprises explicitly invest in getting more data.
  • Keeping everything you already generate is the obvious choice for most categories of data, but some of the lowest-value-per-bit logs may forever be thrown away.

8. Google is famously in the camp that there’s no such thing as too much data to analyze. For example, it famously uses >500 “signals” in judging the quality of potential search results. I don’t know how many separate data sources those signals are informed by, but surely there are a lot.

9. Few predictive modeling users demonstrate a need for vast data scaling. My support for that claim is a lot of anecdata. In particular:

  • Some predictive modeling techniques scale well. Some scale poorly. The level of pain around the “scale poorly” aspects of that seems to be fairly light (or “moderate” at worst). For example:
    • In the previous technology generation, analytic DBMS and data warehouse appliance vendors tried hard to make statistical packages scale across their systems. Success was limited. Nobody seemed terribly upset.
    • Cloudera’s Data Science Workbench messaging isn’t really scaling-centric.
  • Spark’s success in machine learning is rather rarely portrayed as centering on scaling. And even when it is, Spark basically runs in memory, so each Spark node is processing all that much data.

10. Somewhere in this post — i.e. right here :) — let’s acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important too.

11. Let’s also mention data marts. Basically, data marts subset and copy data, because the data will be easier to analyze in its copied form, or because they want to separate workloads between the original and copied data store.

  • If we assume the data is on spinning disks or even flash, then the need for that strategy declined long ago.
  • Suppose you want to keep data entirely in memory? Then you might indeed want to subset-and-copy it. But with so many memory-centric systems doing decent jobs of persistent storage too, there’s often a viable whole-dataset management alternative.

But notwithstanding the foregoing:

  • Security/access control can be a good reason for subset-and-copy.
  • So can other kinds of administrative simplification.

12. So what does this all suggest going forward? I believe:

  • Drilldown is and will remain central to BI. If your BI doesn’t support robust drilldown, you’re doing it wrong. “Real-time” use cases are not exceptions to this rule.
  • In a strong overlap with the previous point, drilldown is and will remain central to monitoring. Whatever monitoring means to you, the ability to pinpoint the specific source of interesting signals is crucial.
  • The previous point can be recast as saying that it’s crucial to identify, isolate and explain anomalies. Some version(s) of anomaly management will become a big deal.
  • SQL and “SQL-like” languages will remain integral to analytic processing for a long time.
  • Memory-centric analytic frameworks such as Spark will continue to win. The data size constraints imposed by memory-centric processing will rarely cause difficulties.

Related links

Categories: Other

Monitoring

Sun, 2017-03-26 06:16

A huge fraction of analytics is about monitoring. People rarely want to frame things in those terms; evidently they think “monitoring” sounds boring or uncool. One cost of that silence is that it’s hard to get good discussions going about how monitoring should be done. But I’m going to try anyway, yet again. :)

Business intelligence is largely about monitoring, and the same was true of predecessor technologies such as green paper reports or even pre-computer techniques. Two of the top uses of reporting technology can be squarely described as monitoring, namely:

  • Watching whether trends are continuing or not.
  • Seeing if there are any events — actual or impending as the case may be — that call for response, in areas such as:
    • Machine breakages (computer or general metal alike).
    • Resource shortfalls (e.g. various senses of “inventory”).

Yes, monitoring-oriented BI needs investigative drilldown, or else it can be rather lame. Yes, purely investigative BI is very important too. But monitoring is still the heart of most BI desktop installations.

Predictive modeling is often about monitoring too. It is common to use statistics or machine learning to help you detect and diagnose problems, and many such applications have a strong monitoring element.

I.e., you’re predicting trouble before it happens, when there’s still time to head it off.

As for incident response, in areas such as security — any incident you respond to has to be noticed first Often, it’s noticed through analytic monitoring.

Hopefully, that’s enough of a reminder to establish the great importance of analytics-based monitoring. So how can the practice be improved? At least three ways come to mind, and only one of those three is getting enough current attention.

The one that’s trendy, of course, is the bringing of analytics into “real-time”. There are many use cases that genuinely need low-latency dashboards, in areas such as remote/phone-home IoT (Internet of Things), monitoring of an enterprise’s own networks, online marketing, financial trading and so on. “One minute” is a common figure for latency, but sometimes a couple of seconds are all that can be tolerated.

I’ve posted a lot about all this, for example in posts titled:

One particular feature that could help with high-speed monitoring is to meet latency constraints via approximate query results. This can be done entirely via your BI tool (e.g. Zoomdata’s “query sharpening”) or more by your DBMS/platform software (the Snappy Data folks pitched me on that approach this week).

Perennially neglected, on the other hand, are opportunities for flexible, personalized analytics. (Note: There’s a lot of discussion in that link.) The best-acknowledged example may be better filters for alerting. False negatives are obviously bad, but false positives are dangerous too. At best, false positives are annoyances; but too often, alert fatigue causes you employees to disregard crucial warning signals altogether. The Gulf of Mexico oil spill disaster has been blamed on that problem. So was a fire in my own house. But acknowledgment != action; improvement in alerting is way too slow. And some other opportunities described in the link above aren’t even well-acknowledged, especially in the area of metrics customization.

Finally, there’s what could be called data anomaly monitoring. The idea is to check data for surprises as soon as it streams in, using your favorite techniques in anomaly management. Perhaps an anomaly will herald a problem in the data pipeline. Perhaps it will highlight genuinely new business information. Either way, you probably want to know about it.

David Gruzman of Nestlogic suggests numerous categories of anomaly to monitor for. (Not coincidentally, he believes that Nestlogic’s technology is a great choice for finding each of them.) Some of his examples — and I’m summarizing here — are:

  • Changes in data format, schema, or availability. For example:
    • Data can completely stop coming in from a particular source, and the receiving system might not immediately realize that. (My favorite example is the ad tech firm that accidentally stopped doing business in the whole country of Australia.)
    • A data format change might make data so unreadable it might as well not arrive.
    • A decrease in the number of approval fields might highlight a questionable change in workflow.
  • Data quality NULLs or malformed values might increase suddenly, in particular fields and data segments.
  • Data value distribution This category covers a lot of cases. A few of them are:
    • A particular value is repeated implausibly often. A bug is the likely explanation.
    • E-commerce results suddenly decrease, but only from certain client technology configuration. Probably there is a bug affecting only those particular clients.
    • Clicks suddenly increase from certain client technologies. A botnet might be at work.
    • Sales suddenly increase from a particular city. Again this might be fraud — or more benignly, perhaps some local influencers have praised your offering.
    • A particular medical diagnosis becomes much more common in a particular city. Reasons can range from fraud, to a new facility for certain kinds of tests, to a genuine outbreak of disease.

David offered yet more examples of significant anomalies, including ones that could probably only be detected via Nestlogic’s tools. But the ones I cited above can probably be found via any number of techniques — and should be, more promptly and accurately than they currently are.

Related links

Categories: Other

Cloudera’s Data Science Workbench

Sun, 2017-03-19 19:41

0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:

  • One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
  • Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
  • A third way is needed.

Cloudera’s idea for a third way is:

  • You don’t run anything on your desktop/laptop machine except a browser.
  • The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
  • The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.

In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.

1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:

  • Hadoop worker nodes.
  • Hadoop master nodes.
  • Nodes that run Cloudera Manager.

The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.

2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.

And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.

Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.

3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.

  • As noted above, a lot of the hassle of Hadoop-based data science relates to security.
  • As also noted above, evading the hassle by extracting data is a huge security risk. (If you lose customer data, you’re going to have a very, very bad day.)
  • According to Matt, standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is. Cloudera’s otherwise similar notebook-style interface evidently avoids that flaw. (Presumably, it you want to see the output, you rerun the script against the data store yourself.)

4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:

  • Sufficiently good quantitative skills to do the analysis.
  • Sufficiently good computer skills to do SQL queries and so on, but not a lot more than that.

Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.

5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.

6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.

Categories: Other

Introduction to SequoiaDB and SequoiaCM

Sun, 2017-03-12 13:19

For starters, let me say:

  • SequoiaDB, the company, is my client.
  • SequoiaDB, the product, is the main product of SequoiaDB, the company.
  • SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
  • SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
  • … and is usually sold for RDBMS-like use cases …
  • … except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
  • SequoiaDB’s products are open source.
  • SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
  • Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

  • SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
  • Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
    • SequoiaDB’s founders were Chinese nationals.
    • Most of them went back to China.
    • Other employees to date have been entirely Chinese.
    • Sales to date have been entirely in China, but SequoiaDB has international aspirations
  • SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
  • SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
  • SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

SequoiaDB’s technology story starts:

  • SequoiaDB is a layered DBMS.
  • It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
  • Indexes are B-tree.
  • Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
    • There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
    • If the number of physical partitions changes, logical partitions are reassigned accordingly.
  • Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
  • Relational batch processing is done via SparkSQL.
  • There also is a block/LOB (Large OBject) storage engine meant for content management applications.
  • SequoiaCM boils down technically to:
    • SequoiaDB, which is used to store JSON metadata about the LOBs …
    • … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
    • A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

  • SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
  • Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” :) — are handled by the JSON store.
  • PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

  • Content management via SequoiaCM.
  • “Operational data lakes”.
  • Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

  • 2-3 years of data, and not all the data even from that time period.
  • Only enough processing power to support structured business intelligence …
  • … and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

  • They hold as much relational data as customers choose to dump there.
  • That data can be simply copied from operational stores, with no transformation.
  • Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
  • Queries can be run straight against this data soup.
  • Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

  • Photographs as part of an authentication process.
  • Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
  • Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Categories: Other

One bit of news in Trump’s speech

Tue, 2017-02-28 23:26

Donald Trump addressed Congress tonight. As may be seen by the transcript, his speech — while uncharacteristically sober — was largely vacuous.

That said, while Steve Bannon is firmly established as Trump’s puppet master, they don’t agree on quite everything, and one of the documented disagreements had been in their view of skilled, entrepreneurial founder-type immigrants: Bannon opposes them, but Trump has disagreed with his view. And as per the speech, Trump seems to be maintaining his disagreement.

At least, that seems implied by his call for “a merit-based immigration system.”

And by the way — Trump managed to give a whole speech without saying anything overtly racist. Indeed, he specifically decried the murder of an Indian-immigrant engineer. By Trump standards, that counts as a kind of progress.

Categories: Other

Coordination, the underused “C” word

Tue, 2017-02-28 22:34

I’d like to argue that a single frame can be used to view a lot of the issues that we think about. Specifically, I’m referring to coordination, which I think is a clearer way of characterizing much of what we commonly call communication or collaboration.

It’s easy to argue that computing, to an overwhelming extent, is really about communication. Most obviously:

  • Data is constantly moving around — across wide area networks, across local networks, within individual boxes, or even within particular chips.
  • Many major developments are almost purely about communication. The most important computing device today may be a telephone. The World Wide Web is essentially a publishing platform. Social media are huge. Etc.

Indeed, it’s reasonable to claim:

  • When technology creates new information, it’s either analytics or just raw measurement.
  • Everything else is just moving information around, and that’s communication.

A little less obvious is the much of this communication could be alternatively described as coordination. Some communication has pure consumer value, such as when we talk/email/Facebook/Snapchat/FaceTime with loved ones. But much of the rest is for the purpose of coordinating business or technical processes.

Among the technical categories that boil down to coordination are:

  • Operating systems.
  • Anything to do with distributed computing.
  • Anything to do with system or cluster management.
  • Anything that’s called “collaboration”.

That’s a lot of the value in “platform” IT right there. 

Meanwhile, in pre-internet apps:

  • Some of the early IT wins were in pure accounting and information management. But a lot of the rest were in various forms of coordination, such as logistics and inventory management.
  • The glory days of enterprise apps really started with SAP’s emphasis on “business process'”. (“Business process reengineering” was also a major buzzword back in the day.)

This also all fits with the “route” part of my claim that “historically, application software has existed mainly to record and route information.”

And in the internet era:

  • “Sharing economy” companies, led by Uber and Airbnb, have created a lot more shareholder value than the most successful pure IT startups of the era.
  • Amazon, in e-commerce and cloud computing alike, has run some of the biggest coordination projects of all.

This all ties into one of the key underlying subjects to modern politics and economics, namely the future of work.

  • Globalization is enabled by IT’s ability to coordinate far-flung enterprises.
  • Large enterprises need fewer full-time employees when individual or smaller-enterprise contractors are easier to coordinate. (It’s been 30 years since I drew a paycheck from a company I didn’t own.)
  • And of course, many white collar jobs are being entirely automated away, especially those that can be stereotyped as “paper shuffling”.

By now, I hope it’s clear that “coordination” covers a whole lot of IT. So why do I think using a term with such broad application adds any clarity? I’ve already given some examples above, in that:

  • “Coordination” seems clearer than “communication” when characterizing the essence of distributed computing.
  • “Coordination” seems clearer than “communication” if we’re discussing the functioning of large enterprises or of large-enterprise-substitutes.

Further — even when we focus on the analytic realm, the emphasis on “coordination” has value. A big part of analytic value comes in determining when to do something. Specifically that arises when:

  • Analytics identifies a problem that just occurred, or is about to happen, allowing a timely fix.
  • Business intelligence is using for monitoring, of impending problems or otherwise, as a guide to when action is needed.
  • Logistics of any kind get optimized.

I’d also say that most recommendation/personalization fits into the “coordination” area, but that’s a bit more of a stretch; you’re welcome to disagree.

I do not claim that analytics’ value can be wholly captured by the “coordination” theme. Decisions about whether to do something major — or about what to do — are typically made by small numbers of people; they turn into major coordination exercises only after a project gets its green light. But such cases, while important, are pretty rare. For the most part, analytic results serve as inputs to business processes. And business processes, on the whole, typically have a lot to do with coordination.

Bottom line: Most of what’s valuable in IT relates to communication or coordination. Apparent counterexamples should be viewed with caution.

Related links

Categories: Other

Pages