Re: Examples of SQL anomalies?

From: Brian Selzer <brian_at_selzer-software.com>
Date: Sat, 5 Jul 2008 20:40:23 -0400
Message-ID: <YvUbk.13163$xZ.5629_at_nlpi070.nbdc.sbc.com>


"Cimode" <cimode_at_hotmail.com> wrote in message news:a2546fe6-6a1d-4560-a782-8cfef53e779f_at_25g2000hsx.googlegroups.com...

> On 5 juil, 01:52, "Brian Selzer" <br..._at_selzer-software.com> wrote:

>> "Cimode" <cim..._at_hotmail.com> wrote in message
>>
>> news:47943627-3b66-4fb2-950f-46359441f020_at_e53g2000hsa.googlegroups.com...
>>
>>
>>
>> > On 5 juil, 00:00, "Brian Selzer" <br..._at_selzer-software.com> wrote:
>> >> "David Cressey" <cresse..._at_verizon.net> wrote in message
>>
>> >>news:jSwbk.249$rb1.205_at_trndny08...
>>
>> >> > "Brian Selzer" <br..._at_selzer-software.com> wrote in message
>> >> >news:zOvbk.31378$ZE5.24693_at_nlpi061.nbdc.sbc.com...
>>
>> >> >> "David Cressey" <cresse..._at_verizon.net> wrote in message
>> >> >>news:D8ubk.240$rb1.211_at_trndny08...
>>
>> >> >> > "Brian Selzer" <br..._at_selzer-software.com> wrote in message
>> >> >> >news:gI6bk.11000$cW3.8591_at_nlpi064.nbdc.sbc.com...
>>
>> >> >> >> "David Cressey" <cresse..._at_verizon.net> wrote in message
>> >> >> >>news:wq4bk.28$0V1.10_at_trndny01...
>>
>> >> >> >> > "Brian Selzer" <br..._at_selzer-software.com> wrote in message
>> >> >> >> >news:nFD9k.5753$LG4.2422_at_nlpi065.nbdc.sbc.com...
>>
>> >> >> >> >> "-CELKO-" <jcelko..._at_earthlink.net> wrote in message
>>
>> >> >news:f219a6bd-9d8e-4cfe-9d60-ce9dcaeff16d_at_z66g2000hsc.googlegroups.com...
>> >> >> >> >> >>> The question is, if these issues are due to the SQL
>> >> > specification
>> >> >> > or
>> >> >> >> >> >>> simply due to a problem in a specific SQL product. Or
>> >> >> >> >> >>> could
>> >> >> >> >> >>> it
>> >> > be,
>> >> >> >> > that
>> >> >> >> >> >>> the definition is not precise enough in some points, so
>> >> >> >> >> >>> that
>> >> >> > database
>> >> >> >> >> >>> vendors implemented it differently? <<
>>
>> >> >> >> >> > Nope, it is the specs. All aggregate (set) functions begin
>> >> >> >> >> > by
>> >> >> >> >> > removing the NULLs from their parameter set, then if there
>> >> >> >> >> > is
>> >> >> >> >> > a
>> >> >> >> >> > DISTINCT option on the parameter, they remove redundant
>> >> >> >> >> > duplicates
>> >> >> > and
>> >> >> >> >> > finally do the operation (MIN, MAX, AVG, SUM, COUNT on what
>> >> >> >> >> > is
>> >> > left.
>> >> >> >> >> > Since an empty set has no elements upon which to apply an
>> >> > operation,
>> >> >> >> >> > SQL returns a NULL (okay, it should be an "undefined" if we
>> >> >> >> >> > were
>> >> >> >> >> > mathematically correct).
>>
>> >> >> >> >> MIN, MAX and AVG are meaningless when applied to an empty
>> >> >> >> >> bag,
>> >> >> >> >> but
>> >> > it
>> >> >> >> > seems
>> >> >> >> >> to me that COUNT should always return 0 when the bag is
>> >> >> >> >> empty,
>> >> >> >> >> and
>> >> >> >> >> similarly, SUM should return 0. SUM should only return NULL
>> >> >> >> >> if
>> >> >> >> >> one
>> >> > of
>> >> >> >> >> the
>> >> >> >> >> values to be summed is NULL.
>>
>> >> >> >> > By definition, none of the values to be summed are NULL.
>>
>> >> >> >> OK. OK. Yes, I know. NULL is not a value. I guess to be
>> >> >> >> precise,
>> >> >> >> I
>> >> >> > should
>> >> >> >> have said "if the cardinality of the set of rows targeted by the
>> >> >> >> query
>> >> > is
>> >> >> >> greater than the number of values to be summed, then SUM should
>> >> >> >> return
>> >> >> >> NULL," but I think that would have caused more confusion.
>> >> >> >> Bottom
>> >> >> >> line:
>> >> >> >> if
>> >> >> >> not all of the amounts are known, then the total amount is
>> >> >> >> suspect.
>>
>> >> >> > This has to do with a continuing disagreement between you and me.
>> >> >> > In
>> >> >> > my
>> >> >> > view there is no such thing as the "set of rows targeted by the
>> >> >> > query".
>> >> >> > There is a set of data targeted by the query. That is all. Rows
>> >> >> > that
>> >> > do
>> >> >> > not contain any data with regard to the current query are not
>> >> >> > part
>> >> >> > of
>> >> > the
>> >> >> > target, by definition.
>>
>> >> >> > Thus, sum (SALARY) from FUBAR is the sum of a bag of salaries.
>> >> >> > It's
>> >> > not
>> >> >> > the sum of a bag of rows from FUBAR. Rows in FUBAR that don't
>> >> >> > contain
>> >> >> > a
>> >> >> > SALARY are not part of the sum. That is all.
>>
>> >> >> What about sum (SALARY) from FUBAR where Fu = 'Bar'? Doesn't
>> >> >> "where
>> >> >> Fu =
>> >> >> 'Bar'" restrict the query to the result of evaluating "where Fu =
>> >> >> 'Bar'"?
>> >> >> Clearly if COUNT(*) where Fu = 'Bar' is greater than COUNT(*) where
>> >> >> Fu
>> >> >> =
>> >> >> 'Bar' and SALARY is not NULL, then it is obvious that some salaries
>> >> >> aren't
>> >> >> known, hence the total of those salaries, sum (SALARY) where Fu =
>> >> >> 'Bar'
>> >> >> is
>> >> >> suspect.
>>
>> >> >> I can't understand that you don't get this since it is just so
>> >> >> simple.
>>
>> >> > It is simple, I do get it, and you are wrong.
>>
>> >> How can that be? If I have ten packages that need to be shipped, but
>> >> I
>> >> only
>> >> know what eight of them weigh, then how can I print a bill of lading
>> >> that
>> >> requires the total weight for the shipment? Obviously, the sum of
>> >> just
>> >> the
>> >> eight weights will be less than what is actually shipped, so it should
>> >> be
>> >> obvious that any attempt to pass that sum off as the total weight must
>> >> be
>> >> held suspect.
>>
>> > How about "the system you plan on designing with NULLS does not handle
>> > that case OR it handles only te case for the 8 weigthed items"
>> > limitation. Best thing is extrapolation.
>>
>> Best thing is to treat the sum as suspect, as I have been suggesting from
>> the start. Extrapolation is problematic because there is no reason to
>> assume that the remaining items are comparable to those whose weight has
>> already been supplied. Suppose for example that the eight items for
>> which
>> the weight has already been supplied are each under 2 pounds, but that
>> the
>> remaining two items each weigh well over a ton. Extrapolation would
>> yield
>> an estimate orders of magnitude in error.
> I should have written..*next* best thing is extrapolation.  Neither
> extrapolation or flag a result as suspect are correct solutions to the
> problem.  In fact both of them are hacks the former being a more
> sophisticated then the latter.
>
> *suspect* has exactly the same downsides than *extrapolation* except
> that it may lead to further confusion due to subjective interpretation
> by users for the exact same reasons you mentionned.  Why would an
> output be more suspect than another.  Definitely not a good idea.
> Extrapolation through mathematical functions at least has the
> advantage of allowing the system to use probabilism

It sounds to me like you're arguing against missing information. Missing information is a fact of life--regardless of whether it represented as nulls or as absent tuples. In fact, the latter requires an open-world interpretation, since under the closed-world interpretation, if a tuple isn't present, then the assertion it represents is false. On the other hand, a null indicates that there should be a value but it hasn't been supplied. It is in accord with the closed-world interpretation since the assertion represented by an incomplete tuple is true. The bottom line is that if information can be missing, such as the weight of a package, then you're going to be faced with indeterminate results, regardless of how the information or lack thereof is represented. I think *suspect* does not have the downsides of *extrapolation* because it is the user that is guessing, not the system. I am more inclined to believe that the user is better able to make educated guesses than the system. Ask yourself this: would the user be informed that the answer to his query is the result of an extrapolation? Unlikely. If, however, the result of a sum is NULL, and if the user knows that a NULL result means that information that would normally have gone into that result is missing, then the user can query further to find out what is missing and make a better educated guess, or even perhaps supply what is missing. Why would the user bother to inquire further if he already has an answer--even if that answer is orders of magnitude off the mark? Received on Sun Jul 06 2008 - 02:40:23 CEST

Original text of this message