Aggregation is Inverse of Join

From: Marshall <marshall.spight_at_gmail.com>
Date: 14 Jan 2007 13:08:21 -0800
Message-ID: <1168808901.723906.156680_at_s34g2000cwa.googlegroups.com>



A while back, we got another brilliant insight from Vadim that aggregation including group-by is the inverse operation of join.

As a possibly amusing side note on the human condition, here is how my understanding of his insight went:

Vadim: aggregation is inverse of join.
Marshall: what?
Vadim: you heard me.
Marshall: um, okay, I kind of see, but here are a bunch of objections.
Vadim: those don't hold up because blah blah blah. [five months pass]
Marshall: I get it!

Anyway, the basic idea is that sum() can be seen to be a many:one relation that, when we divide a relation by it, with a little renaming, will aggregate a column and group by the other columns.

A:{a, b, i}
sum:{i *-> total}

(Relation A has attribute a, b, and i.
Sum is an aggregate function taking bag-of-i to total.)

The expression:

  A / sum

is the same as:

  select a, b, sum(i) from A

Interestingly, if we *multiply* the divisor by an attribute, we alter the grouping.

  A / (b * sum)

is the same as:

  select a, sum(i) from A

We can consider multiplication by an attribute as being either a join with the projection of the universal relation U over the attribute,
or a join with the projection of the dividend over the attribute. Or possibly a join with the domain of the attribute. (Join in this case being a cross product.)

There are renaming issues involved in making this work, but they are not very interesting so I'm not going to discuss them here.

Anyway, I've been trying to organize my thoughts on this issue for a while, and I haven't had much success, so I'm just going to blurt out a bunch of things and we'll see where it goes.

The idea of a many-to-one function is interesting. It's sort of the reverse of a multifunction. Multifunctions are sometimes expressed as generators in programming languages; perhaps we can imagine a language where many-to-one functions are expressed as aggregates.

Vadim initially described the function as a bag -> singleton relation, however I note that only works for functions whose binary form is both commutative and associative. (Such as sum, which is aggregated addition, which is comm. and assoc.) If the binary function isn't both commutative and associative, then we must consider the order of the input, which is to say the function is list -> singleton.

This is also reminiscent of the situation with one-to-many functions: we may have a generator that generates duplicate or ordered outputs, in which case we have to preserve order, or it might generate unique outputs, in which case we can consider it independent of order. For example, a generator that produces an infinite stream of ones could not be considered a set, because it would only have one member. However a generator that produces the natural numbers could be considered in sequence or out. Further I note many places where computation is ordered; eg. recursive functions. Thus I conclude that the most general form of multifunctions is list-to-single, or single-to-list.

I also observe that I don't have a good name for many-to-one functions. One-to-many are generally called multifunctions; once I clued in to aggregation-as-division I've been calling many-to-one functions aggregates, but these will not generally be seen to be the same.

Well, I've run out of steam. More later.

Marshall

PS. I told you I wasn't organized. Received on Sun Jan 14 2007 - 22:08:21 CET

Original text of this message