MySQL - Selecting a Column not in Group By
Asked Answered
I

4

66

I'm trying to add features to a preexisting application and I came across a MySQL view something like this:

SELECT
     AVG(table_name.col1),
     AVG(table_name.col2),
     AVG(table_name.col3),
     table_name.personID,
     table_name.col4
FROM table_name
GROUP BY table_name.personID;

OK so there's a few aggregate functions. You can select personID because you're grouping by it. But it also is selecting a column that is not in an aggregate function and is not a part of the GROUP BY clause. How is this possible??? Does it just pick a random value because the values definitely aren't unique per group?

Where I come from (MSSQL Server), that's an error. Can someone explain this behavior to me and why it's allowed in MySQL?

Inositol answered 21/6, 2009 at 7:6 Comment(0)
T
67

It's true that this feature permits some ambiguous queries, and silently returns a result set with an arbitrary value picked from that column. In practice, it tends to be the value from the row within the group that is physically stored first.

These queries aren't ambiguous if you only choose columns that are functionally dependent on the column(s) in the GROUP BY criteria. In other words, if there can be only one distinct value of the "ambiguous" column per value that defines the group, there's no problem. This query would be illegal in Microsoft SQL Server (and ANSI SQL), even though it cannot logically result in ambiguity:

SELECT AVG(table1.col1), table1.personID, persons.col4
FROM table1 JOIN persons ON (table1.personID = persons.id)
GROUP BY table1.personID;

Also, MySQL has an SQL mode to make it behave per the standard: ONLY_FULL_GROUP_BY

FWIW, SQLite also permits these ambiguous GROUP BY clauses, but it chooses the value from the last row in the group.


At least in the version I tested. What it means to be arbitrary is that either MySQL or SQLite could change their implementation in the future, and have some different behavior. You should therefore not rely on the behavior staying they way it is currently in ambiguous cases like this. It's better to rewrite your queries to be deterministic and not ambiguous. That's why MySQL 5.7 now enables ONLY_FULL_GROUP_BY by default.

Toneme answered 21/6, 2009 at 8:22 Comment(3)
I'd like to remark that this is not entirely true. As of ANSI SQL-99 selected fields must be aggregates, are functionally dependent on the group by clause. So selecting user_name when grouping by user_id is totally fine. SQL Server and Oracle don't comply with this, because they wouldn't allow user_name to be selected when only user_id is in the group by list; and MySQL doesn't comply, because it doesn't check whether each column selected is really functionally dependent on user_id.Calceolaria
@ThorstenKettner, thanks, you are correct. MySQL 5.7 has been improved, and it's much more intelligent in this case of supporting ANSI SQL.Toneme
Here is also an article which describes that issue: percona.com/blog/2006/09/06/…Epigraphic
I
17

I should have Googled for just a bit longer... It seems I found my answer.

MySQL extends the use of GROUP BY so that you can use nonaggregated columns or calculations in the SELECT list that do not appear in the GROUP BY clause. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. For example, you do not need to group on customer.name in the following query

In standard SQL, you would have to add customer.name to the GROUP BY clause. In MySQL, the name is redundant.

Still, that just seems... wrong.

Inositol answered 21/6, 2009 at 7:9 Comment(3)
You're right that it seems wrong. It is! While I'm sure there are some exception cases, as was pointed out by Bill Karwin above, too often I have seen developers, who don't know the data well enough or how this feature really works, write queries with improper group by clauses and obtain bad results. This feature should be off by default and allowed to be purposely overridden with a query option for use in cases where the engineer is informed enough to use it.Baal
It's no more "wrong" than having SELECT * FROM table1 return results in a given, consistent order: it's a feature, not a bug.Neilla
@Neilla It clearly is "more wrong" (worse:)). SQL is defined to be a set-based language and the order of the elements in a set is irrelevant. The SELECT * FROM table1 returns the same set of records as long as the records of the table don't change. By contrast, the a GROUPed query will return different set of records, depending on the order how the records were inserted into the table. This is definitely wrong. Similarly to many other things related to mysql, this is a dangerous gotcha mysql tries to sell as a "feature".Usurp
U
3

Let's say you have a query like this:

SELECT g, v 
FROM t
GROUP BY g;

In this case, for each possible value for g, MySQL picks one of the corresponding values of v.

However, which one is chosen, depends on some circumstances.

I read somewhere that for each group of g, the first value of v is kept, in the order how the records were inserted into the table t.

This is quite ugly, because the records in a table should be treated as a set where the order of the elements should not matter. This is so "mysql-ish"...

If you want to determine which value for v to keep, you need to apply a subselect for t like this:

SELECT g, v 
FROM (
    SELECT * 
        FROM t 
        ORDER BY g, v DESC
) q
GROUP BY g;

This way you define which order the records of the subquery are processed by the external query, thus you can trust which value of v it will pick for the individual values of g.

However, if you need some WHERE conditions then be very careful. If you add the WHERE condition to the subquery then it will keep the behaviour, it will always return the value you expect:

SELECT g, v 
FROM (
    SELECT * 
        FROM t 
        WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9' 
        ORDER BY g, v DESC
) q
GROUP BY g;

This is what you expect, the subselect filters and orders the table. It keeps the records where g has the given value and the external query returns that g and the first value for v.

However, if you add the same WHERE condition to the outer query then you get a non-deterministic result:

SELECT g, v 
FROM (
    SELECT * 
        FROM t 
        -- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9' 
        ORDER BY g, v DESC
) q
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g;

Surprisingly, you may get different values for v when executing the same query again and again which is... strange. The expected behaviour is to get all the records in the appropriate order from the subquery, filtering them in the outer query and then picking the same as it picked in the previous example. But it does not.

It picks a value for v seemingly randomly. The same query returned different values for v if I executed more (~20) times, but the distribution was not uniform.

If instead of adding an outer WHERE, you specify a HAVING condition like this:

SELECT g, v 
FROM (
    SELECT * 
        FROM t1 
        -- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9' 
        ORDER BY g, v DESC
) q
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g
HAVING g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9';

Then you get a consistent behaviour again.


CONCLUSION

I would suggest not to rely on this technique at all. If you really want/need to then avoid WHERE conditions in the outer query. Use it in the inner query if you can or a HAVING clause in the outer query.

I tested it with this data:

CREATE TABLE t1 (
    v INT,
    g VARCHAR(36)
);

INSERT INTO t1 VALUES (1, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
INSERT INTO t1 VALUES (2, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');

in MySQL 5.6.41.

Maybe it is just a bug that gets/got fixed in newer versions, please give feedback if you have experience with newer versions.

Usurp answered 5/12, 2018 at 5:33 Comment(0)
S
0
select * from personel where p_id IN(select
min(dbo.personel.p_id)
FROM
personel
GROUP BY dbo.personel.p_adi)
Stlouis answered 15/4, 2013 at 9:21 Comment(2)
This definitely does not answer the questionGermanism
@Germanism It doesn't, but it kind-of explains what's happening. The code above is an example of how this non-standard behaviour can be modeled using standard SQL.Madai

© 2022 - 2024 — McMap. All rights reserved.