Re: Resiliency To New Data Requirements

From: Marshall <marshall.spight_at_gmail.com>
Date: 16 Aug 2006 21:40:14 -0700
Message-ID: <1155789614.688385.250150_at_h48g2000cwc.googlegroups.com>


Keith H Duggar wrote:
>
> It probably is off-topic, however, for general usage I'm
> forced to partly agree with Dawn that the various *ML's do
> have semi-structure.

Well, I would draw a distinction between "general usage" and technical usage.

And this does all circle back to my recurring point about definitions: anyone can make one; they are intrinsically neither right nor wrong. However, some definitions are more formally made and sometimes more authoritatively endorsed. If we agree that some term has some definition, we can usefully use that to communicate.

In the context of data management (which is in fact our current context in this NG) I expect the term "structured data" to mean something more specific than I might if I used the term with Aunt Mildred.

I think an important matter that is left out of consideration of the "structure" related terms is perspective. If I have a schema for some structured data, and I don't tell you what it is, but instead send you that data as an undifferentiated byte buffer, because you run an encryption service and I want the buffer encrypted, that is structured data for me but unstructured data for you.

(Ben Kenobi: "So the data I sent you *was* structured ... from a certain point of view.")

The problem I have with terms like "unstructured" and especially "semi-structured" is that I don't think there is any agreement on what it means. Where I work is filled with quite smart, well educated people. I don't work at Initech by any means. And on occasion, I have heard people (again, *smart* people) use the term "semi-structured" and later asked them what *exactly* does that mean? And by and large they hem and haw and make vague attempts to define it, but in the end I'm convinced they are using the term in an evocative rather than technical way. It means "kinda structured." Or even "badly structured." This definition of "semi-structured" I don't think has much to teach us.

Now, there are other definitions. The best one I've heard I got from Jan Hidders, although I don't remember whether it was him defining it or it was defined in a paper he referenced. Referring back to the point-of-view issue, we can define semi-structured data as data for which we know only part of the schema. And in fact there are some very interesting use cases there.

In general, in a technical context, especially a data-management one, when I use the term "structured data" I mean data for which I have the schema. "Unstructured data" is data for which I do not have the schema. And "semi-structured data" is data for which I have some of the schema.

I don't particularly buy in to the idea of data for which there is intrinsically no structure, outside of something that is just noise, such as a measurement of radioactive decay. (Although I intend to steer well away from any field I am completely ignorant about, such as, oh I don't know, chemical engineering say, lest I embarrass myself.)

Another factor to consider is schema reconstruction. There may exist structured data that we have in hand, but we may not have the schema itself. But by looking at the data, we may be able to reconstruct what the schema is. In fact, it is likely that that reconstruction process will be only partially sucessful, resulting in (wait for it) semi-structured data.

What about plain English text? How "structured" is the Declaration of Independence, A Shropshire Lad, the wikipedia entry for Obi-wan Kenobi, or this post? I believe, although I cannot demonstrate, that human thought has a schema. It is of course a hugely complex schema, vastly moreso than any schema of deliberate human invention. And in fact it is likely a different schema for each person, but with many commonalities and cultural trends. But whether I am right about this or not is irrelevant for now as the issue is AI-complete.

Which returns us to the context issue. When we are talking about structured data management, the relevant processing entity is the computer. And the computer isn't going to be able to do much without an *explicit* schema.

For the computer, the issue is simple enough to be reduced to a bumper sticker:

  No schema, no semantics.
  Know schema, know semantics.

> I think almost anyone would understand
> what you meant to communicate if you said something like
> "plain text is unstructured, relational data is structured,
> and the stuff in between like HTML is semi-structured". Sure
> the semi-structure sucks in major ways but the word semi-
> structured communicates the concept just fine.

Well, what *is* the structure, or semi-structure, that HTML has that plain text doesn't? I don't see that it's really anything more than "put this word in bold." Okay, there are also anchor tags, and you have an href and the link text. But this structure is one for which the schema is already fixed; it is the HTML grammar. SQL in contrast provides a meta-model; a model with which one generates models. HTML does not provide a meta model, just a model. Any data management solution, or programming language for that matter, requires a meta model, not just a model.

Marshall

http://en.wikipedia.org/wiki/Obiwan_Kenobi http://en.wikipedia.org/wiki/A_Shropshire_Lad Received on Thu Aug 17 2006 - 06:40:14 CEST

Original text of this message