Re: Graph Schema

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Tue, 14 Nov 2006 14:20:11 +0200
Message-ID: <Pine.SOL.4.62.0611131651380.28107_at_kruuna.helsinki.fi>


On 2006-11-13, barias_at_axiscode.com wrote:

> A) Is it true that a graph (as in graph theory, nodes and edges), when
> represented recursively in a schema, is considered a nemesis to DBAs
> and application developers? Or are such schemas considered "routine"
> for skilled DBAs and application developers?

As Bob said, recursive structures are easily represented. At least in theory they can be rather efficiently processed as well, and you cannot avoid them if that's the kind of structure your problem domain happens to possess. Current interfaces to off-the-shelf DBMS's do not always sit too well with complicated recursion (or schema cyclicity) because general transitive closure processing is not supported (though closure queries to lone self-referential tables are) and client side traversal easily bumps into client-server latency and interface parallelism issues. In general I'd say working with recursive structures is still routine, if a bit tedious.

> B) For a given application, suppose you had a choice between a schema
> that was domain centric, or a schema that took the various domain
> "entities" and abstracted them into a graph (see question #A).

This boils down to EAV, which is generally a bad idea. Since I argued a while back that EAV might have at least some valid uses, I think I should share the other half of my reasoning.

Above you mention abstraction. That's all fine and good if that's what it's really about and a higher level of abstraction fits the problem at hand. But there is a big difference between abstraction and what I've called reification in the past. Graph advocates often talk about the former when they actually mean the latter, and going with that will land you in a world of hurt.

Abstraction is about glossing over irrelevant detail to concentrate on overarching, unifying features of some set of phenomena. By definition it always comes at a price: you assume less to be able to generalize, but then you'll no longer be able to describe the minutiae. A typical example would be the generalization from persons and companies to customers which might be of either kind: the generalization allows you to handle both corporate and individual customers on an equal footing, but then your unified codebase won't be able to answer some interesting questions which can only be meaningfully asked about individuals. At the very highest levels of abstraction you might talk about entities and relations, i.e. pretty much anything, but then the questions you can answer are already wholly uninteresting. It's easy to see that abstraction is sometimes useful and sometimes not; it's a data engineering tradeoff.

Generally this is not what graph data models do, however. Instead they try to cram every bit of detail there is to know about your domain into a representation that looks uniform. They do this by utilizing a graph metamodel, and describing how your data can be represented in it. For example, instead of saying "there is a person x with name y and date of birth z" they choose one of the many possible graph representations for the structure and describe that instead. None of the detail has been abstracted away, but instead the immediate object of the description has changed to something with a more uniform structure: we're no longer describing objects in our domain of discourse, but the things needed in our chosen metamodel to talk about those objects. The apparent simplicity comes from the artificial structure of the metamodel, not from the simplification of the problem domain that abstraction proper would bring about. In knowledge representation terms this is reification: making talk representable within the data model, so that talk about talk (metatalk) becomes possible.

Such a mechanism is useful when you e.g. want to talk about statements without asserting them. For instance, "I believe ((Simon says (there is a person x with name y and date of birth z)) is hogwash)". But from the ordinary database perspective, we don't really want to go there because here we're only discussing regular, factual knowledge, and reifying it would just cause us to be once removed from it. Instead of declaring the structure of the underlying data to the DBMS, we'd be declaring the structure of the metamodel and leaving the rest hanging in the air, because the DBMS obviously doesn't understand what is going on inside our overlay model. That then implies that much of the functionality provided by the DBMS is made unavailable -- EAV is often used because we want to circumvent typing and integrity constraints -- that we have to reimplement at least some of it in our application logic -- EAV often leads to performance degradation and high maintenance cost because of the need for huge self-joins to reconstruct the data -- and at worst that we'll jeopardize the semantics of our data model by mixing asserted and reified data together -- in many graph models, like RDF, binary relations are represented natively while nonbinary ones cannot be, and the resulting tension is never adequately resolved.

> Suppose the "graph" schema was (1) half the size (2) more data-driven
> and (3) seemingly easier to do application development with.

In terms of the above, graph data models can be nice when you want to talk *about* data; that's what they're built for. But when you actually want to talk data, that is to *do* something with it, you'll want a data model, not a metamodel. As David Cressey pointed out, in the long term this usually becomes necessary even in applications which started out doing pure metalevel manipulation. When your data management needs are restricted to abstract stuff like "update property y of x to value z" and "dump everything you know about x", a generic graph model seems to be precisely what you need. But sooner or later somebody will want you to guarantee that all of your individual customers are of legal age or to do some real-life customer segmentation on top of your dataset. Then suddenly you're going to have to write pages upon pages of code to do the assertion, validation, statistical processing and what not that would otherwise have been handled with a few lines of SQL.

> Or is it a well understood database principle and practice to avoid
> such "temptations" in favor of the domain centric schema? Why?

It is, and the reason is that what you normally want is maximum expressivity in the schema, with cheap, shrink-wrapped functionality building on top of it. When you need higher levels of abstraction on top of that, for the most part you can achieve that with inheritance modelling, judicious use of views, and middleware/glue, quite without sacrificing the expressivity of your base data model. The only time you'll actually want to go with a graph/EAV/semistructured base model is when a) you're more or less guaranteed to remain at the metalevel for the life of the application, as in certain kinds of data agnostic middleware, b) your application framework/APIs/DBMS do not support your abstraction and processing requirements well enough, so that you could work at multiple levels of abstraction simultaneously, and/or c) you're willing to spend a lot of effort reimplementing functionality that you'd take for granted under real DBMSs and schemas.

Earlier when I stated that there could be applications for EAV where you don't yet know your data well enough, but also that data should be moved to a proper schemas before actually being used, this is what I was talking about. As long as you don't understand the data, it makes sense to talk about it qua abstract data, without asserting any of it. After all, if you don't understand it, you won't even be able to guarantee its integrity. At the same time metalevel manipulations like comparisons across to-be relations and attributes are important so that you can mine and start understanding the data. Hiding the data inside a high level metamodel that is not understood by the DBMS, that doesn't have any integrity constraints, that treats types and instances on an equal footing, that makes metalevel operations like quantification over relations easy, and so on, seems natural to me. But for the same reasons it also isn't what you'll want in a production environment.

Finally, sometimes EAV is suggested because we want some of the metalevel functionality, and it might not be directly or naturally enough supported by our DBMS. Quantification over relations and attributes is a case in point -- it is a structure from second order logic, and thus foreign to relational algebra. If you have to implement something like that, it can be so much easier to do over EAV than over catalog data (which is meant for the same purpose, but is not standardized and fails to be as easy to use) that there is now some pressure to abandon proper schema design.

But then the problem is in the DBMS or in our middleware, neither of which allows you to work at both levels of abstraction (here, relations qua propositional statements, and relations qua objects within the universe of discourse) at the same time. The way I see it, the most productive way to correct that problem is to leverage as much of the existing relational machinery as possible, and to fill in the rest. Not to ditch the relational model and cook up our own from scratch. That's also more or less what is done in the better EAV middleware like Nadkarni's EAV/CR, so it's that sort of stuff that I'd like to see developed. Not naïve EAV elevated into a principle.

-- 
Sampo Syreeni, aka decoy - mailto:decoy_at_iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
---559023410-630072926-1163506811=:15300--
Received on Tue Nov 14 2006 - 13:20:11 CET

Original text of this message