Re: Surrogate Keys: an Implementation Issue

From: JOG <jog_at_cs.nott.ac.uk>
Date: 1 Aug 2006 05:28:14 -0700
Message-ID: <1154435294.728381.182060_at_h48g2000cwc.googlegroups.com>


Brian Selzer wrote:
> "JOG" <jog_at_cs.nott.ac.uk> wrote in message
> news:1154375986.924741.168930_at_p79g2000cwp.googlegroups.com...
> > Brian Selzer wrote:
> >> I prefer to think of a surrogate as more of a temporal attribute. While
> >> its
> >> value identifies a proposition in the database, it also indicates that a
> >> new
> >> proposition became known by the database, allowing it to become part of
> >> the
> >> discussion.
> >
> > A logfile will indicate when a relvar has a different relation value in
> > it. One doesn't need a surrogate for that.
> >
> >> What do I mean by that? Surrogates make it possible to
> >> determine whether totally new information is being added to the body of
> >> knowledge or that existing information has had to be altered to reflect
> >> changes in the circumstances underpinning that information.
> >
> > Totally new information? All updates add totally new information. It's
> > a totally new relation each time you make an update. You are confusing
> > RT with the entity/relationship model. The RT is not concerned with
> > entities but statements of fact. I recommend having another look at the
> > ORM as it helped me shift out of that mindset.
> >
>
> Yes, totally new. I perceive a significant difference between a proposition
> that reflects new circumstances and a proposition that reflects changes in
> existing circumstances. For example, if a new employee, John Smith, is
> hired, then information about him is totally new and now an additional tuple
> must exist to reflect that information; on the other hand, if employee Joyce
> Miller gets married and becomes Joyce Harper, then information about her
> isn't totally new--much of it was already known by the database prior to the
> update, and now a replacement tuple must exist to reflect that information.
> In practice, it is often necessary to define extra constraints on added
> information.

Well I doubt we're going to agree then Brian, because according to relational theory all updates/inserts are equally new information. We are replacing one relvar value with another and there is no "link" between the two and this is a fundamental part of the underlying mathematics.

> For example, several tasks may need to be performed by
> Personnel for a new employee, requiring that the several attributes in the
> Employee relation that denote whether or not those tasks have been completed
> be set a certain way for a new employee. If all updates add totally new
> information, then RT is not sufficient to meet that need, and we must rely
> on implementation-specific extensions to define such constraints.
>
> >> The Relational Model is insufficient in that regard because it does not
> >> require that
> >> primary keys be immutable.
> >>
> >> I think that surrogates should be hidden from end-users, but that is more
> >> of
> >> a best practice than a requirement and should be imposed upon and
> >> enforced
> >> by applications and database administrators. I disagree with the notion
> >> that surrogates should "leak" into the real world. They should be
> >> accessible to applications in order to prevent the update problems I've
> >> described in an earlier post, but their values should never appear on a
> >> data
> >> entry screen or document. There are many reasons for this, here are a
> >> couple:
> >>
> >> (1) Anything that an end-user can see is subject to a change order. For
> >> example, a customer may object to being referred to by the number "666."
> >> (2) It may be necessary to take a database offline and change the
> >> physical
> >> representation or values for surrogates, such as is the case when
> >> databases
> >> are merged. If a user has written down the key value, he'll complain
> >> that
> >> he can't find the information anymore.
> >
> > If you hide it, it is a row address. A pointer, if one at the logical
> > level.
> >
>
> No, it's not.

Hey, have the courtesy to expand glib responses. I've explained why imo a hidden surrogate is no more than an artificial entity ID - and essentially what one is doing with hidden surrogates is no different to utilising CODASYL record ID's. (and we all know about that right?). Where's the theory behind them? Convince me via the math.

>
> >>
> >> There should be a second candidate key on every relation. The difference
> >> between OIDs and surrogates is that OIDs are pointers that reference an
> >> instance of an object while a surrogate indicates that a statement of
> >> fact
> >> became known.
> >
> > OID's reference an instance of a class, not an object,
>
> True.
>
> >... but you are
> > correct that they are pointers. Yet if you hide a surrogate, then there
> > is little difference, it becomes a pointer to a proposition - the field
> > itself is no part of that real world proposition, it merely addresses
> > it. Again I honestly believe it is thinking in terms of entities that
> > generates the confusion.
> >
>
> Again, the database shouldn't hide them, the applications and DBA should.
>
> There is a significant difference. Yes, a surrogate identifies a
> proposition, just like any other candidate key, so in one sense it "points"
> to a proposition in the same sense that any candidate key "points" to a
> proposition, but the difference manifests itself at the theoretical level,
> rather than the implementational level. At the theoretical level,
> surrogates enable the propositions in one database state to be correlated
> with the propositions in the next, making it possible to differentiate
> between added, replaced and retired information during an update, permitting
> targeted constraints to be defined.

Look, if one is using a surrogate in this way one is saying that no candidate keys are sufficient to identify a proposition for one reason or another. You are saying that somehow, in the face of the accepted fundamentals of liebniz equality, there is a situation when every single one of an item's attributes have changed and yet it is still somehow the same thing. You're certainly going to have to provide more justification for a kludge of this nature and its huge philosophical break with accepted logic. But hey, I'm all ears if you can do that formally.

>
> >> This means that in an object store there can be object
> >> instances that are identical with exception of the object identifier.
> >
> > And with a hidden surrogate one can record identical real world
> > propositions, with the exception of that row identifier. Its the same
> > mistake imo.
> >
>
> No, one can't. A surrogate is not a surrogate without a second candidate
> key; therefore, it is not possible to record identical real world
> propositions with a surrogate.
>
> >> While it may be interesting to be able to count the number of identical
> >> object
> >> instances, the same proposition can't have become known more than once
> >> within the same database state. Relations are sets of propositions, so
> >> it
> >> doesn't make any sense to allow duplicates, since something is either
> >> known
> >> to be true or not.
> >
> > Neither does it make sense to permit duplicate statements of fact, by
> > using a hidden surrogate.
> >
>
> A surrogate does not allow duplicate statements of fact. See my previous
> comment.
>
> >> Therefore, I disagree with the idea of using a surrogate
> >> without a second candidate key, even if that key includes every other
> >> attribute in the relation. A surrogate is not a surrogate without a
> >> second
> >> candidate key.
> >>
> >> "JOG" <jog_at_cs.nott.ac.uk> wrote in message
> >> news:1154262656.521112.118530_at_b28g2000cwb.googlegroups.com...
> >> > This discussion has illuminated me, and I would like to share that. A
> >> > row is of course merely a proposition, and is identified by the
> >> > predicate fields that make each proposition in that set unique. A
> >> > surrogate key similar helps us identify a particular row, so there
> >> > initially seems no problem there, and its not an addressing issue.
> >> > Rather the issue is that the surrogate _does not exist_ in the real
> >> > world, it is initially an artifice of the database, and is not part of
> >> > the original proposition.
> >> >
> >> > So whats it doing there? Consider two people with the same names:
> >> >
> >> > Ex ( Name(x, "Sarah Jones") & Marital_status(x, married) ) &
> >> > Ey ( Name(y, "Sarah Jones") & Marital_status(y, married) ) &
> >> > x != y
> >> >
> >> > We don't have sufficient information to distinguish x and y (note this
> >> > specifically refers to the propositions, not the conceptual entities
> >> > involved). Perhaps this distinguishing information is impossible to
> >> > record, or simply cost ineffective to do so. Anyhow, If we had a
> >> > mechanism that seperated propositions x and y (an OID for example),
> >> > then how on earth would we know which one to update when the time came?
> >> > (Obviously we couldn't, and thats part of the problem with OID's.)
> >> >
> >> > But are surrogates in RM any different? My initial opinion was that the
> >> > answer was NO, that they seem to be no different to OID's, except in
> >> > their encoding. However gradually my opinion has changed. The reason
> >> > for this is that while an OID is always physical, a surrogate key leaks
> >> > into the real world almost as soon as it is generated. While it is
> >> > initially an artifice, it quickly becomes a real attribute, and that
> >> > validates it's use as part of the logical model. Hence the rows become:
> >> >
> >> > Ex ( EmpID(x, 1) & Name(x, "Sarah Jones") & status(x, married) ) &
> >> > Ey ( EmpID(x, 2) & Name(y, "Sarah Jones") & status(y, married) )
> >> >
> >> > Noone can tell you what their OID in some database is, but they may
> >> > well be able to tell you what their employee/payroll number is.
> >> >
> >> > It hence seems to me that it is the leak in to the real world that is
> >> > absolutely _vital_ to validate surrogate use.
> >> >
> >

In the end, maybe there is no way we will agree, as you view the world in somehow definably permanent entities with some divine temporal consistency on their identity. This is a very common and understsandable view, but it is an artifice and imo unhelpful to the progress of accurate information recording. Received on Tue Aug 01 2006 - 14:28:14 CEST

Original text of this message