Re: Surrogate Key vs Production Key

From: Noons <wizofoz2k_at_yahoo.com.au>
Date: 9 Oct 2004 17:52:25 -0700
Message-ID: <73e20c6c.0410091652.26df4485@posting.google.com>

"Jonathan Lewis" <jonathan_at_jlcomp.demon.co.uk> wrote in message news:<ck8k9q$127$1_at_titan.btinternet.com>...

> This doesn't make your original suggestion any
> more likely to be realistic. Manual data entry of child
> rows normally requires at least an initial cross-check
> against the existence of a parent. Your "thousands"
> of rows would probably require a typist to make the
> same typing error thousands (or at least hundreds) of
> times in a rows without noticing. And if the data entry
> was a consequence of acquiring the real value from a
> real object this seems an unlikely chain of events.

I am not proposing a theory. What I related are facts. Which I have witnessed first hand. A wrongly typed postcode can be used in thousands of related rows. For that, it gets pulled in from a dropdown list, it is not entered manually. Same goes for any other top-level hierarchical information: it is entered into the system once and gets used from dropdowns. Six months later someone realises that postcode 3133 should have been 3134. Bingo: maintenance nightmare with natural keys. Zero hassle with surrogate keys.

Worse: three months later someone realises that Hawk120 is the incorrect name for an aircraft, it should have been Hawk128. So they mark Hawk120 as invalid and create a new entry. Next, you hear the monthly flight training reports are up the proverbial. So are the pilot flight stats. And it takes a concerted effort to try and find out what the problem was in the first place: after all, the user has done nothing wrong in terms of the operation of the system: just created a new type of aircraft and obsoleted another one. All perfectly acceptable business activities.

This is repeated every day, every where, all over IT. Natural keys are a great concept that just does not work in real life. Not with their current implementation, anyway.

> Ice cream melt ? Not in the UK ;)

Ooops. *Bad* example...

> > > Deferrable Constraints should deal with this.

> There are no such things as complex relationships -
> there are only collections of very simple PK/FK links.

I can send you some truly 'orrible ones! Don't get me started on subtyping in Oracle...

> It may be a coincidence, but all the systems that I
> have seen that want to partition on a "rolling date"
> want to do so because they want to drop old partitions
> and load new data efficiently. If they were able to just
> drop the data "because the date is in the past" then, and
> this is perhaps the coincidence, the date ALWAYS
> turned out to be a part of the primary key.

Different experience here. Give you a concrete example: mobile phone calls. Their PK is the phone call number in the data collection system. Their partitioning is by date and accounting status. So that after two months in the system they can quietly be "dw-housed" and dropped off the live accounting system. Let the DB2 dw folks worry about them...

I can think of further examples in the insurance industry, with claims. And probably a few more: flight data comes to mind. In general, any bulk capture data that needs to stay on the system for only a temporary period.

> Clarification: I was thinking only of Oracle systems
> and the Oracle implementation of partition independence
> which requires a guarantee that (and it isn't worded this
> way in the manuals) the partition that a row belongs in
> should be uniquely defined by the value of any primary
> or unique key that has been defined on the table.

*or unique key*. That is the whole thing, isn't it? A surrogate key cab be the PK, and the natural key the UK (Yikes!). Then all is well, isn't it?

> If you think the hardware can make rubbish fly, then
> the hardware should be able to deal with a large volume
> of cascading updates across primary/foreign keys. But
> you've discounted that as not viable. If you're happy
> to write code to handle rubbish that needs to be cleaned
> up, then it's a bit unreasonable to say that you'd rather
> not have to do it. (Feel free to replace "rubbish" with
> "fuzzy data" in the above).

Well, it's the reality. Purity of data or its "rubishing" down is the exclusive domain of the end user: they decide what stays or goes. It's up to us to figure a design that can cope with the business requirements. Not up to the business to cope with inflexible system design. As such, we must accept the "rubbish", let them use it and let them fix it later if that is the case. What we can't do is tell them how to run their business (weeeeeelllll: SAP can...).

As for the cascade updates: it is a much, much higher penalty than what I have to pay with surrogates. As software stands nowadays, not saying it won't improve. I can't possibly walk into a high volume environment and tell them I'm taking away 50% of the capacity of the system everytime someone corrects a typo. You are not just deferring, you are also changing a bucketload of indexes everywhere that PK is referenced as a FK. No need to explain the impact of that in ANY system. However, I can easily tell them they need another 30 bytes per row and one extra index. And they can easily correct their data without a cascading impact everywhere. Weigh in the pros and cons? I have no doubt whatsoever where the advantage is. Received on Sat Oct 09 2004 - 19:52:25 CDT