Re: Surrogate Key vs Production Key

From: Howard J. Rogers <hjr_at_dizwell.com>
Date: Sun, 10 Oct 2004 12:27:19 +1000
Message-Id: <41689eaf$0$23895$afc38c87@news.optusnet.com.au>

Noons wrote:

> "Jonathan Lewis" <jonathan_at_jlcomp.demon.co.uk> wrote in message
> news:<ck8k9q$127$1_at_titan.btinternet.com>...
>

>> This doesn't make your original suggestion any
>> more likely to be realistic. Manual data entry of child
>> rows normally requires at least an initial cross-check
>> against the existence of a parent.  Your "thousands"
>> of rows would probably require a typist to make the
>> same typing error thousands (or at least hundreds) of
>> times in a rows without noticing.  And if the data entry
>> was a consequence of acquiring the real value from a
>> real object this seems an unlikely chain of events.

>
> I am not proposing a theory. What I related are facts.
> Which I have witnessed first hand. A wrongly typed postcode
> can be used in thousands of related rows. For that, it gets
> pulled in from a dropdown list, it is not entered manually.
> Same goes for any other top-level hierarchical information: it
> is entered into the system once and gets used from dropdowns.
> Six months later someone realises that postcode 3133 should
> have been 3134. Bingo: maintenance nightmare with natural
> keys. Zero hassle with surrogate keys.

It's the last bit I don't get. Are you saying that it would be no problem with a surrogate key, because all you'd have to update is a non-key "postcode" field?

So what's the hassle with writing a trigger that effects an 'on update cascade' functionality?

I mean, obviously we all agree that updating a primary key is not desirable. And if that primary key has been used as a foreign key in many child rows, that makes the undesirability worse. But it's the updating of the key that's undesirable, not its use in the first place as the primary key. If there's a trigger or a procedure that can be invoked to replicate the update to all affected child rows, that seems to mitigate the undesirability big-time.

I can't, in short, see that what you are describing is a real argument against natural keys. As opposed to a real argument against sloppy coding practices that allow sloppy user practices to taint the data in the first place.

> Worse: three months later someone realises that Hawk120 is the
> incorrect name for an aircraft, it should have been Hawk128.
> So they mark Hawk120 as invalid and create a new entry.
> Next, you hear the monthly flight training reports are up the
> proverbial. So are the pilot flight stats. And it takes a
> concerted effort to try and find out what the problem was in
> the first place: after all, the user has done nothing wrong in
> terms of the operation of the system: just created a new type
> of aircraft and obsoleted another one. All perfectly acceptable
> business activities.

Again, I used to have this all the time too: "Cannot delete code BS1, it is in use by 5163 plots" my application would reply when I tried a silly parent table DML. Version two of the application used to come back with "The code BS1 is in use by 5163 plots of land. Would you like me to change all of these plots to some other code, or delete them entirely".

I see you go on to say that cascading updates or deletes would be a performance problem. All I can say is, I never had a problem with natural keys. Practically and really. The need for cascading updates and deletes was rare enough for it not to be a performance issue. And if it wasn't a rare occurrence, I would still say that's an application design and coding issue, not a data modelling one. (Or perhaps a user training issue).

The Royal Parks in London got their grass mown and their shrub beds mulched on the basis of the use of natural keys. I think Her Maj might have been a synthetic key supporter by now if they hadn't been up to the job.

Regards
HJR
> This is repeated every day, every where, all over IT. Natural
> keys are a great concept that just does not work in real life.
> Not with their current implementation, anyway.
>
>

>>     Ice cream melt ?  Not in the UK ;)

>
> Ooops. *Bad* example...
>
>

>> > > Deferrable Constraints should deal with this.


> 

>>     There are no such things as complex relationships -
>>     there are only collections of very simple PK/FK links.

>
> I can send you some truly 'orrible ones!
> Don't get me started on subtyping in Oracle...
>

>>     It may be a coincidence, but all the systems that I
>>     have seen that want to partition on a "rolling date"
>>     want to do so because they want to drop old partitions
>>     and load new data efficiently.  If they were able to just
>>     drop the data "because the date is in the past" then, and
>>     this is perhaps the coincidence, the date ALWAYS
>>     turned out to be a part of the primary key.

>
>
> Different experience here. Give you a concrete example:
> mobile phone calls. Their PK is the phone call number in
> the data collection system. Their partitioning is by date
> and accounting status. So that after two months in the system
> they can quietly be "dw-housed" and dropped off the live
> accounting system. Let the DB2 dw folks worry about them...
>
> I can think of further examples in the insurance industry,
> with claims. And probably a few more: flight data comes to
> mind. In general, any bulk capture data that needs to stay
> on the system for only a temporary period.
>

>>     Clarification: I was thinking only of Oracle systems
>>     and the Oracle implementation of partition independence
>>     which requires a guarantee that (and it isn't worded this
>>     way in the manuals) the partition that a row belongs in
>>     should be uniquely defined by the value of any primary
>>     or unique key that has been defined on the table.

>
> *or unique key*. That is the whole thing, isn't it?
> A surrogate key cab be the PK, and the natural key the
> UK (Yikes!). Then all is well, isn't it?
>

>>     If you think the hardware can make rubbish fly, then
>>     the hardware should be able to deal with a large volume
>>     of cascading updates across primary/foreign keys. But
>>     you've discounted that as not viable.  If you're happy
>>     to write code to handle rubbish that needs to be cleaned
>>     up, then it's a bit unreasonable to say that you'd rather
>>     not have to do it. (Feel free to replace "rubbish" with
>>     "fuzzy data" in the above).

>
> Well, it's the reality. Purity of data or its "rubishing" down
> is the exclusive domain of the end user: they decide what stays
> or goes. It's up to us to figure a design that can cope with
> the business requirements. Not up to the business to cope with
> inflexible system design. As such, we must accept the "rubbish",
> let them use it and let them fix it later if that is the case.
> What we can't do is tell them how to run their business
> (weeeeeelllll: SAP can...).
>
> As for the cascade updates: it is a much, much higher penalty than
> what I have to pay with surrogates. As software stands nowadays,
> not saying it won't improve. I can't possibly walk into a high
> volume environment and tell them I'm taking away 50% of the
> capacity of the system everytime someone corrects a typo.
> You are not just deferring, you are also changing a bucketload
> of indexes everywhere that PK is referenced as a FK. No need
> to explain the impact of that in ANY system.
> However, I can easily tell them they need another 30 bytes per row
> and one extra index. And they can easily correct their data
> without a cascading impact everywhere. Weigh in the pros and cons?
> I have no doubt whatsoever where the advantage is.
Received on Sat Oct 09 2004 - 21:27:19 CDT