Re: Resiliency To New Data Requirements
Date: 17 Aug 2006 05:51:06 -0700
Message-ID: <1155819065.942347.257660_at_m73g2000cwd.googlegroups.com>
Marshall wrote:
> I think an important matter that is left out of consideration
> of the "structure" related terms is perspective. If I have a
> schema for some structured data, and I don't tell you
> what it is, but instead send you that data as an
> undifferentiated byte buffer, because you run an
> encryption service and I want the buffer encrypted,
> that is structured data for me but unstructured data for you.
So the issue is information hiding, of a sort. I'm sending you data you don't care about. While you can see it, you can remain ignorant of it. In the relational world, the corollary would be ignoring one or more attributes in a relation - a projection, and probably a view.
> In general, in a technical context, especially a data-management
> one, when I use the term "structured data" I mean data
> for which I have the schema. "Unstructured data" is data
> for which I do not have the schema. And "semi-structured
> data" is data for which I have some of the schema.
I would argue that "unstructured data" refers to a value for which there is no schema. Types and domains, in other words; they'll probably have a render-as-string method to display a representation of the value, and other functions to manipulate them, but there is no "structure."
"Some of the schema" is an interesting definition... but is this anything different from regular expressions applied to a string value? Perhaps it also means that a substring (extracted via a regular expression) can also be relied on to be the representation of some non-string value (e.g. <crap>3</crap> where crap is defined as xs:int).
So how does this apply to relations? Can views accomplish the same thing, or the useful subset thereof? If it's common in XML to extend a schema with an "xs:any" element and redefine that element with detailed types, I haven't seen it. I'm not sure why this would be useful.
> I don't particularly buy in to the idea of data for which
> there is intrinsically no structure, outside of something
> that is just noise, such as a measurement of radioactive
> decay. (Although I intend to steer well away from any field
> I am completely ignorant about, such as, oh I don't know,
> chemical engineering say, lest I embarrass myself.)
Types/domains have no structure - or if they do, it's an implementation detail that only the definer knows. These are values, and not structures, and they can only be "manipulated" (not the right word) using functions defined over them.
> Another factor to consider is schema reconstruction.
> There may exist structured data that we have in hand,
> but we may not have the schema itself. But by looking
> at the data, we may be able to reconstruct what the
> schema is. In fact, it is likely that that reconstruction
> process will be only partially sucessful, resulting in
> (wait for it) semi-structured data.
You could do a similar thing for query output, but I'm not sure this "use case" is a good one for defining a data model (not that you were implying that it was).
> What about plain English text? How "structured" is
> the Declaration of Independence, A Shropshire Lad,
> the wikipedia entry for Obi-wan Kenobi, or this post?
> I believe, although I cannot demonstrate, that human
> thought has a schema.
Some postings on this newsgroup demonstrate otherwise. :-)
Certainly human language as used in practice has no schema (other than xs:any), or many wonderful books (e.g. Ulysses) could not have been written.
> Which returns us to the context issue. When we are
> talking about structured data management, the relevant
> processing entity is the computer. And the computer
> isn't going to be able to do much without an *explicit*
> schema.
The question about "data" for which we have only some of the schema seems to hinge on data transmission. If we're talking about data management, rather than exchange, then the schema won't be variable. At least I can't think of such a case.
For messages sent to multiple destinations, it might make sense for one recipient to ignore a large chunk of the structure, and another to decompose it. I'm just not sure how things would come to that; perhaps if the first recipient were unwilling to update their schema to accomodate the "new" data?
- erk