Re: Surrogate Keys: an Implementation Issue

From: Brian Selzer <brian_at_selzer-software.com>
Date: Tue, 01 Aug 2006 02:03:49 GMT
Message-ID: <9Eyzg.119$9T3.98_at_newssvr25.news.prodigy.net>

"JOG" <jog_at_cs.nott.ac.uk> wrote in message news:1154375986.924741.168930_at_p79g2000cwp.googlegroups.com...
> Brian Selzer wrote:
>> I prefer to think of a surrogate as more of a temporal attribute. While
>> its
>> value identifies a proposition in the database, it also indicates that a
>> new
>> proposition became known by the database, allowing it to become part of
>> the
>> discussion.
>
> A logfile will indicate when a relvar has a different relation value in
> it. One doesn't need a surrogate for that.
>
>> What do I mean by that? Surrogates make it possible to
>> determine whether totally new information is being added to the body of
>> knowledge or that existing information has had to be altered to reflect
>> changes in the circumstances underpinning that information.
>
> Totally new information? All updates add totally new information. It's
> a totally new relation each time you make an update. You are confusing
> RT with the entity/relationship model. The RT is not concerned with
> entities but statements of fact. I recommend having another look at the
> ORM as it helped me shift out of that mindset.
>

Yes, totally new. I perceive a significant difference between a proposition that reflects new circumstances and a proposition that reflects changes in existing circumstances. For example, if a new employee, John Smith, is hired, then information about him is totally new and now an additional tuple must exist to reflect that information; on the other hand, if employee Joyce Miller gets married and becomes Joyce Harper, then information about her isn't totally new--much of it was already known by the database prior to the update, and now a replacement tuple must exist to reflect that information. In practice, it is often necessary to define extra constraints on added information. For example, several tasks may need to be performed by Personnel for a new employee, requiring that the several attributes in the Employee relation that denote whether or not those tasks have been completed be set a certain way for a new employee. If all updates add totally new information, then RT is not sufficient to meet that need, and we must rely on implementation-specific extensions to define such constraints.

>> The Relational Model is insufficient in that regard because it does not
>> require that
>> primary keys be immutable.
>>
>> I think that surrogates should be hidden from end-users, but that is more
>> of
>> a best practice than a requirement and should be imposed upon and
>> enforced
>> by applications and database administrators. I disagree with the notion
>> that surrogates should "leak" into the real world. They should be
>> accessible to applications in order to prevent the update problems I've
>> described in an earlier post, but their values should never appear on a
>> data
>> entry screen or document. There are many reasons for this, here are a
>> couple:
>>
>> (1) Anything that an end-user can see is subject to a change order. For
>> example, a customer may object to being referred to by the number "666."
>> (2) It may be necessary to take a database offline and change the
>> physical
>> representation or values for surrogates, such as is the case when
>> databases
>> are merged. If a user has written down the key value, he'll complain
>> that
>> he can't find the information anymore.
>
> If you hide it, it is a row address. A pointer, if one at the logical
> level.
>

No, it's not.

>>
>> There should be a second candidate key on every relation. The difference
>> between OIDs and surrogates is that OIDs are pointers that reference an
>> instance of an object while a surrogate indicates that a statement of
>> fact
>> became known.
>
> OID's reference an instance of a class, not an object,

True.

>... but you are
> correct that they are pointers. Yet if you hide a surrogate, then there
> is little difference, it becomes a pointer to a proposition - the field
> itself is no part of that real world proposition, it merely addresses
> it. Again I honestly believe it is thinking in terms of entities that
> generates the confusion.
>

Again, the database shouldn't hide them, the applications and DBA should.

There is a significant difference. Yes, a surrogate identifies a proposition, just like any other candidate key, so in one sense it "points" to a proposition in the same sense that any candidate key "points" to a proposition, but the difference manifests itself at the theoretical level, rather than the implementational level. At the theoretical level, surrogates enable the propositions in one database state to be correlated with the propositions in the next, making it possible to differentiate between added, replaced and retired information during an update, permitting targeted constraints to be defined.

>> This means that in an object store there can be object
>> instances that are identical with exception of the object identifier.
>
> And with a hidden surrogate one can record identical real world
> propositions, with the exception of that row identifier. Its the same
> mistake imo.
>

No, one can't. A surrogate is not a surrogate without a second candidate key; therefore, it is not possible to record identical real world propositions with a surrogate.

>> While it may be interesting to be able to count the number of identical
>> object
>> instances, the same proposition can't have become known more than once
>> within the same database state. Relations are sets of propositions, so
>> it
>> doesn't make any sense to allow duplicates, since something is either
>> known
>> to be true or not.
>
> Neither does it make sense to permit duplicate statements of fact, by
> using a hidden surrogate.
>

A surrogate does not allow duplicate statements of fact. See my previous comment.

>> Therefore, I disagree with the idea of using a surrogate
>> without a second candidate key, even if that key includes every other
>> attribute in the relation. A surrogate is not a surrogate without a
>> second
>> candidate key.
>>
>> "JOG" <jog_at_cs.nott.ac.uk> wrote in message
>> news:1154262656.521112.118530_at_b28g2000cwb.googlegroups.com...
>> > This discussion has illuminated me, and I would like to share that. A
>> > row is of course merely a proposition, and is identified by the
>> > predicate fields that make each proposition in that set unique. A
>> > surrogate key similar helps us identify a particular row, so there
>> > initially seems no problem there, and its not an addressing issue.
>> > Rather the issue is that the surrogate _does not exist_ in the real
>> > world, it is initially an artifice of the database, and is not part of
>> > the original proposition.
>> >
>> > So whats it doing there? Consider two people with the same names:
>> >
>> > Ex ( Name(x, "Sarah Jones") & Marital_status(x, married) ) &
>> > Ey ( Name(y, "Sarah Jones") & Marital_status(y, married) ) &
>> > x != y
>> >
>> > We don't have sufficient information to distinguish x and y (note this
>> > specifically refers to the propositions, not the conceptual entities
>> > involved). Perhaps this distinguishing information is impossible to
>> > record, or simply cost ineffective to do so. Anyhow, If we had a
>> > mechanism that seperated propositions x and y (an OID for example),
>> > then how on earth would we know which one to update when the time came?
>> > (Obviously we couldn't, and thats part of the problem with OID's.)
>> >
>> > But are surrogates in RM any different? My initial opinion was that the
>> > answer was NO, that they seem to be no different to OID's, except in
>> > their encoding. However gradually my opinion has changed. The reason
>> > for this is that while an OID is always physical, a surrogate key leaks
>> > into the real world almost as soon as it is generated. While it is
>> > initially an artifice, it quickly becomes a real attribute, and that
>> > validates it's use as part of the logical model. Hence the rows become:
>> >
>> > Ex ( EmpID(x, 1) & Name(x, "Sarah Jones") & status(x, married) ) &
>> > Ey ( EmpID(x, 2) & Name(y, "Sarah Jones") & status(y, married) )
>> >
>> > Noone can tell you what their OID in some database is, but they may
>> > well be able to tell you what their employee/payroll number is.
>> >
>> > It hence seems to me that it is the leak in to the real world that is
>> > absolutely _vital_ to validate surrogate use.
>> >
>
Received on Tue Aug 01 2006 - 04:03:49 CEST

This message: [ Message body ]
Next message: Keith H Duggar: "Re: why hierarchy?"
Previous message: Aloha Kakuikanu: "Re: why hierarchy?"
Maybe in reply to: David Cressey: "Re: Surrogate Keys: an Implementation Issue"
In reply to David Cressey: "Re: Relation Schemata vs. Relation Variables"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message