Re: Surrogate Key vs Production Key

From: Daniel Morgan <damorgan_at_x.washington.edu>
Date: Sat, 09 Oct 2004 12:55:11 -0700
Message-ID: <4168421F.3050204@x.washington.edu>

--CELKO-- wrote:

>>>I have a transaction table having unique external document number.

>
> It can be assured that it is unique. It is a varchar2(30) ... I should
> use this unique external document table as PK? ... OR Should I create
> an ID based on Sequence as 'Surrogate' PK? <<
>
> Ask yourself some basic data modeling questions.
>
> 1) Can I validate the key in itself? For example, an ISBN has a check
> digit, I parse a VIN to see that it is in the format with a simple
> regular expression, etc.
>
> 2) Can I verify the in the reality I am modeling? For example, I can
> read the barcode on the back of a book for the ISBN, I can read the
> VIN off of the engine block of my car, etc.
>
> This unique external document number sounds like it meets both
> criteria.
>
> Exposed physical locators like IDENTITY in T-SQL or Sequence in Oracle
> fail both tests. In fact they are completely non-relational and not
> even part of the reality of the model! Here is a "cut & paste" i use
> in the SQL Server groups when a newbie asks the same question.
> IDENTITY is their auto-numbering "feature":
>
> ============
>
> Let me go ahead one more step and play Q&A with the direction I think
> you are going:
>
> Q: Couldn't a compound key become very long?
>
> A1: So what? This is the 2000's century and we have much better
> computers than we did in the 1950's when key size was a real physical
> issue. What is funny to me is the number of idiots who replace a
> natural two or three integer compound key with a huge GUID that no
> human being or other system can possibly understand because they think
> it will be faster and easy to program.
>
> A2: This is an implementation problem that the SQL engine can handle.
> For example, Teradata is an SQL designed for VLDB apps that uses
> hashing instead of B-tree or other indexes. They guarantee that no
> search requires more than two probes, no matter how large the
> database. A tree index requires more and more probes as the size of
> the database increases.
>
> A3: A long key is not always a bad thing for performance. For
> example, if I use (city, state) as my key, I get a free index on just
> (city). I can also add extra columns to the key to make it a
> super-key when such a super-key gives me a covering index (i.e. an
> index which contains all of the columns required for a query, so that
> the base table does not have to be accessed at all).
>
>

>>>Do you then advocate never using an Identity attribute? Or is it

>
> acceptable (in the relational model) to have an Identity attribute to
> use as a handle to the row, and for attributes in other tables to use
> as the target for a foreign key? <<
>
> A handle to the row? Oh, you mean faking a sequential file's
> positional record number, so I can reference the physical storage
> location? Sure, if I want to lose all the advantages of an abstract
> data model, SQL set oriented programming, carry extra data and destroy
> the portability of code!
>
> More and more programmers who have absolutely no database training are
> being told to design a database. They are using GUIDs, IDENTITY,
> ROWID and other proprietary auto-numbering "features" in SQL products
> to imitate either a record number (sequential file system mindset) or
> OID (OO mindset) since they don't know anything else.
>
> Experienced database designers tend toward intelligent keys they find
> in industry standard codes, such as UPC, VIN, GTIN, ISBN, etc. They
> know that they need to verify the data against the reality they are
> modeling. A trusted external source is a good thing to have. I know
> why this VIN is associated with this car, but why is an identity value
> of 42 associated with this car? Try to verify the relationship in the
> reality you are modeling.
>
> The IDENTITY column is a holdover from the early programming languages
> which were very close to the hardware. For example, the fields (not
> columns; big difference) in a COBOL or FORTRAN program were assumed to
> be physically located in main storage in the order they were declared
> in the program, and as they were punched into cards or recorded on mag
> tapes. The languages have constructs using that model -- logical and
> physical implementations are practically one! The data has meaning
> BECAUSE of the program reading it (i.e. the same bits could be a
> character in one program and be an integer in another)
>
> The early SQLs were based on existing file systems. The data was kept
> in physically contiguous disk pages, in physically contiguous rows,
> made up of physically contiguous columns. In short, just like a deck
> of punch cards or a magnetic tape. Most programmer still carry that
> mental model, which is why I keep doing that rant about file vs.
> table, row vs. record and column vs. field.
>
> But physically contiguous storage is only one way of building a
> relational database and it is not the best one. The basic idea of a
> relational database is that user is not supposed to know *how* or
> *where* things are stored at all, much less write code that depends on
> the particular physical representation in a particular release of a
> particular product on particular hardware at a particular time.
>
> One of the biggest errors is the IDENTITY column (actually property,
> not a column at all; columns have datatypes, datatypes have to be
> NULL-able) in the Sybase/SQL Server family. People actually program
> with this "feature" and even use it as the primary key for the table!
> Now, let's go into painful details as to why this thing is bad.
>
> The first practical consideration is that IDENTITY is proprietary and
> non-portable, so you know that you will have maintenance problems when
> you change releases or port your system to other products. Newbies
> actually think they will never port code! Perhaps they only work for
> companies that are failing and will be gone in short time. Perhaps
> their code is such crap nobody else want their application. But
> professionals write code that has some ability to endure.
>
> But let's look at the logical problems. First try to create a table
> with two columns and try to make them both IDENTITY. If you cannot
> declare more than one column to be of a certain data type, then that
> thing is not a datatype at all, by definition. It is a property which
> belongs to the PHYSICAL table, not the LOGICAL data in the table.
>
> Next, create a table with one column and make it an IDENTITY. Now try
> to insert, update and delete different numbers from it. If you cannot
> insert, update and delete rows from a table, then it is not a table by
> definition.
>
> Finally create a simple table with one IDENTITY and a few other
> columns. Use a few statements like
>
> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
> INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
> INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
>
> To put a few rows into the table and notice that the IDENTITY
> sequentially numbered them in the order they were presented. If you
> delete a row, the gap in the sequence is not filled in and the
> sequence continues from the highest number that has ever been used in
> that column in that particular table. This is how we did record
> numbers in pre-allocated sequential files in the 1950's, by the way. A
> utility program would then "pack" or "compress" the records that were
> flagged as deleted to move the empty space to the physical end of the
> physical file.
>
> But now use a statement with a query expression in it, like this:
>
> INSERT INTO Foobar (a, b, c)
> SELECT x, y, z
> FROM Floob;
>
> Since a query result is a table, and a table is a set which has no
> ordering, what should the IDENTITY numbers be? The entire, whole,
> completed set is presented to Foobar all at once, not a row at a time.
> There are (n!) ways to number (n) rows, so which one do you pick?
> The answer has been to use whatever the *physical* order of the result
> set happened to be.
>
> But it is actually worse than that. If the same query is executed
> again, but with new statistics or after an index has been dropped or
> added, the new execution plan could bring the result set back in a
> different physical order. Non-deterministic behavior is not good for
> data quality.
>
> Can you explain from a logical model why the same rows in the second
> query get different IDENTITY numbers? In the relational model, they
> should be treated the same if all the values of all the attributes are
> identical.
>
> Using IDENTITY as a primary key is a sign that there is no data model,
> only an imitation of a sequential file system. Since this "magic,
> all-purpose, one-size-fits-all" exposed physical locator exists only
> as a result of the physical state of a particular piece of hardware at
> a particular time as read by the current release of a particular
> database product, how do you verify that an entity has such a number
> in the reality you are modeling?
>
> You will see newbies who design tables like this:
>
> CREATE Drivers
> (driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,
> ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
> vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));
>
> Now input data and submit the same row a thousand times, a million
> times. Your data integrity is trashed. The natural key was this:
>
> CREATE Drivers
> (ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
> vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
> PRIMARY KEY (ssn, vin));
>
> I call such newbies "id-iots" because they always name the IDENTITY
> property column "id" in EVERY table. They don't understand basic data
> modeling -- one and only name for an attribute. About half the time
> they don't use any DRI; fields don't have DRI like columns and they
> are designing a file system not an RDBMS.
>
> Having a real key, which I must enforce for the data model to work,
> and the exposed physical locator means I am a man with two masters.
> If I update the table via the ssn and via the id, everything can get
> out of synch.
>
> Finally, an appeal to authority, with a quote from Dr. Codd:
> "..Database users may cause the system to generate or delete a
> surrogate, but they have no control over its value, nor is its value
> ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and
> Codd, E. (1979), Extending the database relational model to capture
> more meaning. ACM Transactions on Database Systems, 4(4). pp.
> 397-434.
>
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means
> never used in queries, DRI or anything else that a user does.
>
> Codd also wrote the following:
>
> "There are three difficulties in employing user-controlled keys as
> permanent surrogates for entities.
>
> (1) The actual values of user-controlled keys are determined by users
> and must therefore be subject to change by them (e.g. if two companies
> merge, the two employee databases might be combined with the result
> that some or all of the serial numbers might be changed.).
>
> (2) Two relations may have user-controlled keys defined on distinct
> domains (e.g. one uses social security, while the other uses employee
> serial numbers) and yet the entities denoted are the same.
>
> (3) It may be necessary to carry information about an entity either
> before it has been assigned a user-controlled key value or after it
> has ceased to have one (e.g. and applicant for a job and a retiree).
>
> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities. A solution - proposed in part [4] and more fully in [14] -
> is to introduce entity domains which contain system-assigned
> surrogates. Database users may cause the system to generate or delete
> a surrogate, but they have no control over its value, nor is its value
> ever displayed to them....." (Codd in ACM TODS, pp 409-410).
>
> References
>
> Codd, E. (1979), Extending the database relational model to capture
> more meaning. ACM Transactions on Database Systems, 4(4). pp.
> 397-434
>
> ============
>
> Working with Oracle and not having ON UPDATE CASCADE is a HUGE pain
> for you. You will have to make up for their failure to provide you
> with even a minimal SQL implementation in their sub-standard product.
> if you can switch to a real SQL, I would do so immediately.

Thanks Joe. But I'll trade not having UPDATE CASCADE for having TSQL on only the Windows operating system any time any day any where on any project. But I've worked on Teradata ... perhaps some day you'll switch to a real O/S.

And with that I sip a 15 year old Laphroaig and salute you!

-- 
Daniel A. Morgan
University of Washington
damorgan_at_x.washington.edu
(replace 'x' with 'u' to respond)

Received on Sat Oct 09 2004 - 14:55:11 CDT