Re: NULLs: theoretical problems?
Date: Fri, 17 Aug 2007 00:03:48 +0200
Message-ID: <3vf9c351l07qa5jf0mui4lmccp8oetkonr_at_4ax.com>
On Sun, 12 Aug 2007 19:21:30 -0700, JOG wrote:
>On Aug 8, 11:43 am, "sinister" <sinis..._at_nospam.invalid> wrote:
>> Many discussions point out one deficiency of NULLs: that they collapse
>> multiple, distinct concepts into one ("no value possible," "value missing,"
>> "value not available at this time", etc).
>>
>> What are the other theoretical problems? My impression from skimming some
>> threads in this ng is that some anomalies might occur, maybe having to do
>> with NULLs and joins, or NULLs and keys composed of more than one field, but
>> I'm not sure.
>
>A database stores true propositions. A statement with a hole in it (an
>SQL null) is not a proposition, and hence is a theoretical abhorrence.
Hi JOG,
In everyday speech, propositions like these are often combined into sentences that contain multiple propositions, like this:
(Representation #3)
"The person identified by SSN 12345 goes by the name John Brown and was
born on 1960-08-31."
"The person identified by SSN 90163 goes by the name Kate Wilson."
"The person identified by SSN 55202 goes by the name Geoff T. Hurt and
was born on 1973-01-12."
In the tabular representation (note that I'm talking about paper and ink tables here, not about tables in a DB), a similar combination can be made, to save space, provide a better overview, and to match the way that real people exchange these proposition. Ths combined table would look as follows:
(Representation #4)
SSN | Name | DOB
-------+---------------+------------
12345 | John Brown | 1960-08-31 90163 | Kate Wilson | - 55202 | Geoff T. Hurt | 1973-01-12
Note the dash in the DOB column for Kate. This does not represent a
birthday of "-"; it is just a symbol to signify the reader that the
writer didn't accidentallly forget to fill out this cell, but rather
that he or she left it empty on purpose. Since this is a very common
practice, it is often not even mentioned in the legend (though it can be
confusing, e.g. in a table that also uses ++, +, -, and -- symbols to
signify a test rating).
I hope that we can agree that the semantics of the propositions don't
change when switching between tabular and verbalised representation - in
other words, representation #1 is equivalent to #2, and #3 is equivalent
to #4.
I also hope that we can agree that combining five short verbalised
single-proposition-sentences into three longer multi-proposition-
sentences doesn't change the meaning of the propositions themselves, and
that therefore representations #2 and #3 are also equivalent.
Since #1 == #2, #2 == #3 and #3 == #4, I can only conclude that #1 == #4
and that there is thus no functional difference between using a single
table (with either empty space, a dash symbol, or some other symbol to
represent a missing proposition) or using multiple tables where missing
propositions can be represented by omitting a complete line.
Back to databases. A DB can't store a dash symbol for a missing
proposition, since (a) a dash symbol is not part of each domain, and (b)
the dash symbol might have a special meaning in a column (as in the
rating column mentioned above). Instead, a DB uses NULL as a symbol with
the same meaning aas the dash in ink-and-paper tables: "this space left
empty intentionally".
The DB table storing the above information would look like this:
SSN | Name | DOB
and this is a faithful representation of these five true propositions:
"The person identified by SSN 12345 was born on 1960-08-31."
"The person identified by SSN 55202 was born on 1973-01-12."
The NULL in the DOB column does not represent a sixth proposition; it
just marks the absence of a proposition involving the SSN 90163 and the
verb "was born on".
I really fail to see how you can call this concept a "theoretical
abhorrence".
>Or from a different angle you might want to consider that a relation
Since NULL is part of any domain in an SQL database, a tuple with NULLs
*is* a subset of the cartesian product of the domains being considered.
>The solution is of course to decompose a relation around its key so
-------+---------------+------------
12345 | John Brown | 1960-08-31
90163 | Kate Wilson | NULL
55202 | Geoff T. Hurt | 1973-01-12
"The person identified by SSN 12345 goes by the name John Brown."
"The person identified by SSN 90163 goes by the name Kate Wilson."
"The person identified by SSN 55202 goes by the name Geoff T. Hurt."
>is a set of tuples. A tuple must contain a value in every position, or
>it is not a valid tuple (not being a subset of the cartesian product
>of the domains being considered). Hence an SQL-null is a theoretical
>abhorrence.
>that no nullable columns are required. This results in a schema with
>more relations, and tends to produce queries with more joins, but
>without any theoretical or logical flaws.
Except of course that many such joins need to be outer joins in order to prevent data from being lost, bringiing the NULLs you just removed right back in. You can eliminate NULLs from the stored data, but not from the product.
Allowing mandatory propositions to be combined but forcing optional
propositions into seperate tables is halfway between these two options,
with an (in my eyes) rather arbitratily chosen rule for when proposions
may or may not be combined.
And the model is extremely unstable, since it's not uncommon in real
businesses to see an optional proposition become mandatory (e.g. ebcause
of legislation change) or a mandatory proposition become optional (e.g.
because of competition). If all propositions are combined into a single
table, this is a simple change from NULL to NOT NULL or vice versa. If
each proposition has its own table, this is an equally simple change,
the addition or removal of a constraint that SQL does not currently
support but that is known as "equality constraint" in Object Role
Modeling. However, if only optional propositions are moved out of the
combined table, these changes result in adding or removing a table,
removing or adding a column to another table, and moving lots of data to
prevent data loss. This argument might be a bit too pragmatic for a
theory group, but I like my data models to be a bit more stable than
that!
>Some view such decomposition as being computationally inefficient,
>others reply that this is a physical implementation issue and not a
>concern of the logical model.
>
>Given the fact that it can generate much longer queries, I sometimes
>find myself allowing nulls in personal home-brew projects out of sheer
>laziness. However I do suffer from internal pangs of guilt during the
>process ;)
I hope that my explanations (if you've even managed to read this far - I did try to keep it short, honest!) have managed to easy your feeling of pain and guilt, as they are not actualy required.
Best, Hugo Received on Fri Aug 17 2007 - 00:03:48 CEST