Does "group by" automatically guarantee "order by"?
Asked Answered
A

6

24

Does "group by" clause automatically guarantee that the results will be ordered by that key? In other words, is it enough to write:

select * 
from table
group by a, b, c

or does one have to write

select * 
from table
group by a, b, c
order by a, b, c

I know e.g. in MySQL I don't have to, but I would like to know if I can rely on it accross the SQL implementations. Is it guaranteed?

Admonish answered 26/1, 2015 at 12:9 Comment(2)
Btw, I just checked: My PostgreSQL 9.3.4 does not automatically return sorted data when using group by.Frustule
I tried to AdventureWorks db of Msdn. It works, group by sorted data, I wonder it depends on data type? So I tried a date and only one group by statements.Bilk
G
32

group by does not order the data neccessarily. A DB is designed to grab the data as fast as possible and only sort if necessary.

So add the order by if you need a guaranteed order.

Girosol answered 26/1, 2015 at 12:10 Comment(2)
An efficient implementation of group by would perform the group-ing by sorting the data internally. That's why some RDBMS return sorted output when group-ing. Yet, the SQL specs don't mandate that behavior, so unless explicitly documented by the RDBMS vendor I wouldn't bet on it to work (tomorrow). OTOH, if the RDBMS implicitly does a sort it might also be smart enough to then optimize (away) the redundant order by.Frustule
thanks @JimmyB. Your comments on this page would probably consist a great answer if put together :-)Admonish
K
5

An efficient implementation of group by would perform the group-ing by sorting the data internally. That's why some RDBMS return sorted output when group-ing. Yet, the SQL specs don't mandate that behavior, so unless explicitly documented by the RDBMS vendor I wouldn't bet on it to work (tomorrow). OTOH, if the RDBMS implicitly does a sort it might also be smart enough to then optimize (away) the redundant order by. @jimmyb

An example using PostgreSQL proving that concept

Creating a table with 1M records, with random dates in a day range from today - 90 and indexing by date

CREATE TABLE WITHDRAW AS
  SELECT (random()*1000000)::integer AS IDT_WITHDRAW,
    md5(random()::text) AS NAM_PERSON,
    (NOW() - ( random() * (NOW() + '90 days' - NOW()) ))::timestamp AS DAT_CREATION, -- de hoje a 90 dias atras
    (random() * 1000)::decimal(12, 2) AS NUM_VALUE
  FROM generate_series(1,1000000);

CREATE INDEX WITHDRAW_DAT_CREATION ON WITHDRAW(DAT_CREATION);

Grouping by date truncated by day of month, restricting select by dates in a two days range

EXPLAIN 
SELECT
    DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '2 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1

HashAggregate  (cost=11428.33..11594.13 rows=11053 width=48)
  Group Key: date_trunc('DAY'::text, dat_creation)
  ->  Bitmap Heap Scan on withdraw w  (cost=237.73..11345.44 rows=11053 width=14)
        Recheck Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))
        ->  Bitmap Index Scan on withdraw_dat_creation  (cost=0.00..234.97 rows=11053 width=0)
              Index Cond: ((dat_creation >= ((now() - '2 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))

Using a larger restriction date range, it chooses to apply a SORT

EXPLAIN 
SELECT
    DATE_TRUNC('DAY', W.dat_creation), COUNT(1), SUM(W.NUM_VALUE)
FROM WITHDRAW W
WHERE W.dat_creation >= (NOW() - INTERVAL '60 DAY')::timestamp
AND W.dat_creation < (NOW() - INTERVAL '1 DAY')::timestamp
GROUP BY 1

GroupAggregate  (cost=116522.65..132918.32 rows=655827 width=48)
  Group Key: (date_trunc('DAY'::text, dat_creation))
  ->  Sort  (cost=116522.65..118162.22 rows=655827 width=14)
        Sort Key: (date_trunc('DAY'::text, dat_creation))
        ->  Seq Scan on withdraw w  (cost=0.00..41949.57 rows=655827 width=14)
              Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))

Just by adding ORDER BY 1 at the end (there is no significant difference)

GroupAggregate  (cost=116522.44..132918.06 rows=655825 width=48)
  Group Key: (date_trunc('DAY'::text, dat_creation))
  ->  Sort  (cost=116522.44..118162.00 rows=655825 width=14)
        Sort Key: (date_trunc('DAY'::text, dat_creation))
        ->  Seq Scan on withdraw w  (cost=0.00..41949.56 rows=655825 width=14)
              Filter: ((dat_creation >= ((now() - '60 days'::interval))::timestamp without time zone) AND (dat_creation < ((now() - '1 day'::interval))::timestamp without time zone))

PostgreSQL 10.3

Kerbing answered 30/3, 2018 at 3:30 Comment(0)
A
3

It depends on the database vendor.

For example PostgreSQL does not automatically sort the grouped result. Here you have to use order by to get the data sorted.

But Sybase and Microsoft SQL Server do. Here you can use order by to change the default sorting.

Alathia answered 9/4, 2020 at 7:34 Comment(0)
M
1

It definitely doesn't. I have experienced that, once one of my queries suddenly started to return not-ordered results, as the data in the table grows by.

Menton answered 21/3, 2016 at 9:17 Comment(2)
This is the really treacherous behavior of GROUP BY - you develop your system around the observation that sorted data is returned and on the one millionth group by operation, suddenly, none-sorted data is returned. Caused me headaches for a week until I crafted out a minimal reproducible example to spot my mistake!Detergency
you should at least mention which database engine you use...Admonish
P
-1

I tried it. Adventureworks db of Msdn.

select HireDate, min(JobTitle)
from AdventureWorks2016CTP3.HumanResources.Employee
group by HireDate

Resuts :

2009-01-10Production Technician - WC40

2009-01-11Application Specialist

2009-01-12Assistant to the Chief Financial Officer

2009-01-13Production Technician - WC50<

It returns sorted data of hiredate, but you don't rely on GROUP BY to SORT under any circumstances.

for example; indexes can change this sorted data.

I added following index (hiredate, jobtitle)

CREATE NONCLUSTERED INDEX NonClusturedIndex_Jobtitle_hireddate ON [HumanResources].[Employee]
(
    [JobTitle] ASC,
    [HireDate] ASC
)

Result will change with same select query;

2006-06-30 Production Technician - WC60

2007-01-26 Marketing Assistant

2007-11-11 Engineering Manager

2007-12-05 Senior Tool Designer

2007-12-11 Tool Designer

2007-12-20 Marketing Manager

2007-12-26 Production Supervisor - WC60

You can download Adventureworks2016 at the following address

https://www.microsoft.com/en-us/download/details.aspx?id=49502

Puree answered 20/1, 2018 at 19:24 Comment(0)
L
-2

It depends on the number of records. When the records are less, Group by sorted automatically. When the records are more(more than 15) it required adding Order by clause

Lubin answered 25/11, 2016 at 10:58 Comment(1)
can you cite any references for that? This is probably only for some particular engine right?Admonish

© 2022 - 2024 — McMap. All rights reserved.