Why does SQL force me to repeat all non-aggregated fields from my SELECT clause in my GROUP BY clause? [closed]
Asked Answered
C

10

38

This has bugged me for a long time.

99% of the time, the GROUP BY clause is an exact copy of the SELECT clause, minus the aggregate functions (MAX, SUM, etc.).
This breaks the Don't Repeat Yourself principle.

When can the GROUP BY clause not contain an exact copy of the SELECT clause minus the aggregate functions?

edit

I realise that some implementations allow you to have different fields in the GROUP BY than in the SELECT (hence 99%, not 100%), but surely that's a very minor exception?
Can someone explain what is supposed to be returned if you use different fields?

Thanks.

Crus answered 6/1, 2009 at 13:59 Comment(0)
E
18

I tend to agree with you - this is one of many cases where SQL should have slightly smarter defaults to save us all some typing. For example, imagine if this were legal:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By *

where "*" meant "all the non-aggregate fields". If everybody knew that's how it worked, then there would be no confusion. You could sub in a specific list of fields if you wanted to do something tricky, but the splat means "all of 'em" (which in this context means, all the possible ones).

Granted, "*" means something different here than in the SELECT clause, so maybe a different character would work better:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By !

There are a few other areas like that where SQL just isn't as eloquent as it could be. But at this point, it's probably too entrenched to make many big changes like that.

Elevation answered 6/1, 2009 at 14:38 Comment(4)
yes. thank you. that sort of thing sounds sensible. I guess the answer is "there isn't really an answer". ;)Crus
Difficult though to select a meaningful character or keyword. The * or "ALL" keyword isn't semantically correct, as you're not grouping by ALL (you're grouping by all EXCEPT . . .). It's not a bad idea, can't see how it would work without being a semantic kludge though.Casein
Why not just GROUP instead of GROUP BY <some wildcard>?Negro
Sure, that would work too, but I suspect that might be less than intuitive for current SQL programmers. For that matter, the syntax could also be to leave it off entirely, and the parser would know that you're supposed to automatically group by any non-aggregate fields. That's probably how it should have been done in the first place, IMO, but maybe that's getting into "trying to be so helpful that you end up being confusing" territory.Elevation
C
7

Because they are two different things, you can group by items that aren't in the select clause

EDIT:

Also, is it safe to make that assumption?

I have a SQL statement

Select ClientName, InvAmt, Sum(PayAmt) as PayTot

Is it "correct" for the server to assume I want to group by ClientName AND InvoiceAmount? I personally prefer (and think it's safer) to have this code

Select ClientName, InvAmt, Sum(PayAmt) as PayTot
Group By ClientName

throw an error, prompting me to change the code to

Select ClientName, Sum(InvAmt) as InvTot, Sum(PayAmt) as PayTot
Group By ClientName
Casein answered 6/1, 2009 at 14:5 Comment(2)
That's true, but why require the columns in the SELECT list if they're mandatory? SELECT a, MAX(c) FROM t GROUP BY b could imply grouping by a, couldn't it? I think it may just be a clarity issue.Closefitting
Implied operations are the bane of debugging and testing.Blunk
C
3

I hope/expect we'll see something more comprehensive soon; a SQL history lesson on the subject would be useful and informative. Anyone? Anyone? Bueller?

In the meantime, I can observe the following:

SQL predates the DRY principle, at least as far as it it was documented in The Pragmatic Programmer.

Not all DBs require the full list: Sybase, for example, will happily execute queries like

SELECT a, b, COUNT(*)
FROM some_table
GROUP BY a

... which (at least every time I accidentally ran such a monster) often leads to such enormous inadvertent recordsets that panic-stricken requests quickly ensue, begging the DBAs to bounce the server. The result is a sort of partial Cartesian product, but I think it may mostly be a failure on Sybase's part to implement the SQL standard properly.

Closefitting answered 6/1, 2009 at 14:9 Comment(0)
R
2

The good reason for it is that you would get incorrect results more often than not if you did not specify all columns. Suppose you have three columns, col1, col2 and col3.

Suppose your data looks like this:

Col1  Col2 Col3
a      b    1
a      c    1
b      b    2
a      b    3

select col1, col2, sum(col3) from mytable group by col1, col2
would give the following results:

Col1  Col2 Col3
a      b    4
a      c    1
b      b    2

How would it interpret
select col1, col2, sum(col3) from mytable group by col1

My guess would be

Col1  Col2 Col3
a      b    5
a      c    5
b      b    2

These are clearly bad results. Of course the more complex the query and the more joins the less likely it would be that the query would return correct results or that the programmer would even know if they were incorrect.

Personally I'm glad that group by requires the fields.

Rumney answered 6/1, 2009 at 19:43 Comment(1)
MySQL - as an example of a SQL syntax that supports this - makes no guarantees about what is returned in this case. Effectively, it could be any random value from the result set. Although I've never seen it happen, it could even return relation attributes from different rows of the same relation.Plagiarism
S
2

Perhaps we need a shorthand form - call it GroupSelect

GroupSelect Field1, Field2, sum(Field3) From SomeTable Where (X = "3")

This way, the parser need only throw an error if you leave out an aggregate function.

Shiner answered 6/1, 2009 at 20:25 Comment(0)
G
2

I agree with GROUP BY ALL, GROUP BY *, or something similar. As mentioned in the original post, in 99% (perhaps more) of the cases you want to group by all non-aggregate columns/expressions.

Here is however one example where you would need GROUP BY columns, for backward compatibility reasons.

SELECT 
  MIN(COUNT(*)) min_same_combination_cnt, 
  MAX(COUNT(*)) max_same_comb_cnt, 
  AVG(COUNT(*)) avg_same_comb_cnt, 
  SUM(COUNT(*)) total_records,
  COUNT(COUNT(*)) distinct_combinations_cnt
FROM <some table>
GROUP BY <list of columns>

This works in Oracle. I use it to estimate selectivity on columns. The group by is applied to the inner aggregate function. Then, the outer aggregate is applied.

It would be nice to put forward a suggestion for this improvement to the SQL standard. I just don't know how that works.

Glycogenesis answered 16/7, 2012 at 18:43 Comment(0)
W
1

Actually, wouldn't that be 100% of the time? Is there a case in which you can have a (non-aggregate) column in the select that is not in the GROUP BY?

I don't have an answer though. It certainly does seem like a awkward moment for the language.

Whiff answered 6/1, 2009 at 14:3 Comment(0)
G
1

I share the op's view that repeating is a bit annoying, especially if the non-aggregate fields contain elaborate statements like ifs and functions and a whole lot of other things. It would be nice if there could be some shorthand in the group by clause - at least a column alias. Referring to the columns by number may be another option, albeit one that probably has their own problems.

Gratian answered 6/1, 2009 at 20:18 Comment(1)
nice point about long expressions. it does add to the pain.Crus
C
0

There could be a situation that you needed to extract one id of all the rows grouped, and sum of their quantities - for example. In this case you would i.e. group them by name and leave ids not grouped. SQLite seems to work this way.

Chatelain answered 11/2, 2012 at 15:45 Comment(0)
S
0

Since group by result in single tuple for a whole group of tuples so other non group by attributes must be used in aggregate function only. If u add non group by attribute in select then sql cant decide which which value to be select from that group.

Springer answered 4/3, 2013 at 20:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.