Why does MySQL allow "group by" queries WITHOUT aggregate functions?

R

6

73

Surprise -- this is a perfectly valid query in MySQL:

select X, Y from someTable group by X

If you tried this query in Oracle or SQL Server, you’d get the natural error message:

Column 'Y' is invalid in the select list because it is not contained in 
either an aggregate function or the GROUP BY clause.

So how does MySQL determine which Y to show for each X? It just picks one. From what I can tell, it just picks the first Y it finds. The rationale being, if Y is neither an aggregate function nor in the group by clause, then specifying “select Y” in your query makes no sense to begin with. Therefore, I as the database engine will return whatever I want, and you’ll like it.

There’s even a MySQL configuration parameter to turn off this “looseness”. http://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by

This article even mentions how MySQL has been criticized for being ANSI-SQL non-compliant in this regard. http://www.oreillynet.com/databases/blog/2007/05/debunking_group_by_myths.html

My question is: Why was MySQL designed this way? What was their rationale for breaking with ANSI-SQL?

Rochellerochemont answered 3/8, 2009 at 23:44 Comment(5)

Let me put it this way. I see this design choice as being equivalent to a programming language choosing to allow and ignore, say, letting "null" be a left-hand-side value. e.g. "null = 3". There's just no reason for letting that happen. It's the kind of error that is always and dangerously wrong. – Rochellerochemont 3/8, 2009 at 23:49

@lumpynose, nonsense, that may have been true pre 5.x – Orchid 31/8, 2011 at 11:43

@lumpynose Can you give a reference to your affirmation? – Vociferation 15/2, 2013 at 21:51

Oh, man, I've wanted something resembling this, so much in SQL Server. I've always imagined it as "I want the "PICKANY(col)" aggregation function, or even better the "UNIQUEVALUE(col)" function, which would error if it turns out NOT to be constant across the group. I tend to use "MAX(col)" as my "all the values are the same, give me that value" aggregation function – Junie 1/5, 2019 at 16:9

Sorry, the above comment was a rambly way of saying "IMO, the answer would be because this functionality is really useful. But I have no citation, other than 'because I want it'". – Junie 1/5, 2019 at 16:10

S

23

I believe that it was to handle the case where grouping by one field would imply other fields are also being grouped:

SELECT user.id, user.name, COUNT(post.*) AS posts 
FROM user 
  LEFT OUTER JOIN post ON post.owner_id=user.id 
GROUP BY user.id

In this case the user.name will always be unique per user.id, so there is convenience in not requiring the user.name in the GROUP BY clause (although, as you say, there is definite scope for problems)

Silden answered 4/8, 2009 at 0:6 Comment(7)

So it was just to save some typing (as in keyboard typing)? Heh. – Rochellerochemont 4/8, 2009 at 1:14

Less columns in GROUP BY clause means faster execution time so it's an optimizing hack. I'm constrantly using MAX(user.name) AS name in similar queries on ANSI SQL implementations. – Lamoreaux 4/8, 2009 at 12:50

@wqw: rubbish. user.name is either in the group by or in an aggregate. This is ambiguous at best See comments to #6060741 Only MySQL allows such bollocks, for example https://mcmap.net/q/41319/-sql-query-not-showing-expected-result/27535 – Hilburn 10/7, 2011 at 16:35

@gbn: Rubbish to what? Using something like SELECT user.id, ANY(user.name) .. GROUP BY user.id is not indeterministic if there is a PK on user.id. Obviously the buckets user.name's are picked up from will contain a single unique value -- the name of the user whose id is grouped on. Using MAX(user.name) AS name or any other aggragate is awkward at best (though intent is more clearly documented). Could be a case of "It depends" again.. – Lamoreaux 18/7, 2011 at 14:0

Postgresql facilitates such functionality and in a controlled manner: ienablemuch.com/2010/08/postgresql-recognizing-functional.html – Buccal 25/8, 2012 at 10:14

AFAIK the latest SQL standard demands all selected field to be either group fields or aggregated fields or fields being determinate for the group, such as user.name in above example. So MySQL is compliant to the latest standard insofar as they let you select fields that are neither group fields nor aggregated fields but determinate for the group - a thing that many other dbms don't allow. However, MySQL breaks standard compliance by not checking wether the field is really determinate, thus allowing indeterminate fields as well. [to be continued] – Adalia 28/5, 2014 at 10:38

It is probably hard for a dbms to reliably find out wether a selected field would be determinate or not. Postgres is too restrictive in this point, MySQL too docile. – Adalia 28/5, 2014 at 10:38

M

26

According to this page (the 5.0 online manual), it's for better performance and user convenience.

Mohur answered 10/8, 2011 at 21:26 Comment(2)

+1, for the MySQL answer :). I use this all the time to avoid doing a sub-query that does an ORDER BY ... LIMIT 1 ... you just have to be careful about knowing that the data you receive in non-aggregated columns will be random to all rows that match your conditions. – Zed 1/7, 2014 at 19:30

LInk is dead, can you update it? or better yet insert the docs here in case it breaks again – Vickievicksburg 28/3, 2017 at 16:5

S

23

I believe that it was to handle the case where grouping by one field would imply other fields are also being grouped:

SELECT user.id, user.name, COUNT(post.*) AS posts 
FROM user 
  LEFT OUTER JOIN post ON post.owner_id=user.id 
GROUP BY user.id

In this case the user.name will always be unique per user.id, so there is convenience in not requiring the user.name in the GROUP BY clause (although, as you say, there is definite scope for problems)

Silden answered 4/8, 2009 at 0:6 Comment(7)

So it was just to save some typing (as in keyboard typing)? Heh. – Rochellerochemont 4/8, 2009 at 1:14

Less columns in GROUP BY clause means faster execution time so it's an optimizing hack. I'm constrantly using MAX(user.name) AS name in similar queries on ANSI SQL implementations. – Lamoreaux 4/8, 2009 at 12:50

@wqw: rubbish. user.name is either in the group by or in an aggregate. This is ambiguous at best See comments to #6060741 Only MySQL allows such bollocks, for example https://mcmap.net/q/41319/-sql-query-not-showing-expected-result/27535 – Hilburn 10/7, 2011 at 16:35

@gbn: Rubbish to what? Using something like SELECT user.id, ANY(user.name) .. GROUP BY user.id is not indeterministic if there is a PK on user.id. Obviously the buckets user.name's are picked up from will contain a single unique value -- the name of the user whose id is grouped on. Using MAX(user.name) AS name or any other aggragate is awkward at best (though intent is more clearly documented). Could be a case of "It depends" again.. – Lamoreaux 18/7, 2011 at 14:0

Postgresql facilitates such functionality and in a controlled manner: ienablemuch.com/2010/08/postgresql-recognizing-functional.html – Buccal 25/8, 2012 at 10:14

AFAIK the latest SQL standard demands all selected field to be either group fields or aggregated fields or fields being determinate for the group, such as user.name in above example. So MySQL is compliant to the latest standard insofar as they let you select fields that are neither group fields nor aggregated fields but determinate for the group - a thing that many other dbms don't allow. However, MySQL breaks standard compliance by not checking wether the field is really determinate, thus allowing indeterminate fields as well. [to be continued] – Adalia 28/5, 2014 at 10:38

It is probably hard for a dbms to reliably find out wether a selected field would be determinate or not. Postgres is too restrictive in this point, MySQL too docile. – Adalia 28/5, 2014 at 10:38

C

3

Unfortunately almost all the SQL varieties have situations where they break ANSI and have unpredictable results.

It sounds to me like they intended it to be treated like the "FIRST(Y)" function that many other systems have.

More than likely, this construct is something that the MySQL team regret, but don't want to stop supporting because of the number of applications that would break.

Coney answered 4/8, 2009 at 0:26 Comment(1)

This definitely seems to be the case. – Indigestible 14/7, 2020 at 22:24

H

2

MySQL treats this is a single column DISTINCT when you use GROUP BY without an aggregate function. Using other options you either have the whole result be distinct, or have to use subqueries, etc. The question is whether the results are truly predictable.

Also, good info is in this thread.

Hetaera answered 4/1, 2012 at 13:27 Comment(0)

T

0

From what I have read in the mysql reference page, it says: "You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group."

I suggest you to read this page (link to the reference manual of mysql): http://dev.mysql.com/doc/refman/5.5/en//group-by-extensions.html

Trotman answered 15/2, 2013 at 21:11 Comment(0)

C

-1

Its actually a very useful tool that all other fields dont have to be in an aggregate function when you group by a field. You can manipulate the result which will be returned by simply ordering it first and then grouping it after. for instance if i wanted to get user login information and i wanted to see the last time the user logged in i would do this.

Tables

USER
user_id | name

USER_LOGIN_HISTORY 
user_id | date_logged_in

USER_LOGIN_HISTORY has multiple rows for one user so if i joined users to it it would return many rows. as i am only interested in the last entry i would do this

select 
  user_id,
  name,
  date_logged_in

from(

  select 
    u.user_id, 
    u.name, 
    ulh.date_logged_in

  from users as u

    join user_login_history as ulh
      on u.user_id = ulh.user_id

  where u.user_id = 1234

  order by ulh.date_logged_in desc 

)as table1

group by user_id

This would return one row with the name of the user and the last time that user logged in.

Coray answered 25/8, 2012 at 10:1 Comment(2)

My example above is only being used purely to show how you can manipulate the returned result. I'm not saying this is how you would return that information in the simplest way. You would use the MAX function. With much more complex queries it becomes very useful to be able to group without aggregate functions being used on all other fields – Coray 25/8, 2012 at 11:9

This contrived example is both longer and slower than simply doing a straight forward "max" and therefore doesn't support your claim that this is "a very useful tool" at all. If you can't even contrive an example of it being useful, I seriously question its usefulness. I also don't believe that intentionally using what is frequently indeterminate functionality is going to get more useful in more complex queries. – Cromorne 23/4, 2015 at 16:15

Recommended topics

Hot tags