GROUP BY and aggregate sequential numeric values
Asked Answered
A

3

15

Using PostgreSQL 9.0.

Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:

Example Table:

+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google  | Programmer | 2000 |
| Google  | Sales      | 2000 |
| Google  | Sales      | 2001 |
| Google  | Sales      | 2002 |
| Google  | Sales      | 2004 |
| Mozilla | Sales      | 2002 |
+-----------------------------+

I'm interested in a query which would output rows similar to the following:

+-----------------------------------------+
| company | profession | year             |
+---------+------------+------------------+
| Google  | Programmer | [2000]           |
| Google  | Sales      | [2000,2001,2002] |
| Google  | Sales      | [2004]           |
| Mozilla | Sales      | [2002]           |
+-----------------------------------------+

The essential feature is that only consecutive years shall be grouped together.

Avaunt answered 4/11, 2011 at 19:13 Comment(0)
K
22

There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.

And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.

That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:

WITH MarkedForGrouping AS (
  SELECT
    company,
    profession,
    year,
    year - ROW_NUMBER() OVER (
      PARTITION BY company, profession
      ORDER BY year
    ) AS seqID
  FROM atable
)
SELECT
  company,
  profession,
  array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
  company,
  profession,
  seqID
Kiona answered 5/11, 2011 at 0:37 Comment(1)
It should be noted that this is the solution I in the long run, opted for using. It was for a single job which ran only periodically, so no need for a stored procedure. The code is clean, precise, and works as intended. Thanks so much!Avaunt
A
26

Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).

The first step is to identify non-consecutive values for the year:

Step 1) Identify non-consecutive values

select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification

This returns the following result:

 company | profession | year | group_cnt
---------+------------+------+-----------
 Google  | Programmer | 2000 |         1
 Google  | Sales      | 2000 |         1
 Google  | Sales      | 2001 |         0
 Google  | Sales      | 2002 |         0
 Google  | Sales      | 2004 |         1
 Mozilla | Sales      | 2002 |         1

Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:

Step 2) Define group IDs

select company,
   profession,
   year,
   sum(group_cnt) over (order by company, profession, year) as group_nr
from ( 
select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification
) t1

This returns the following result:

 company | profession | year | group_nr
---------+------------+------+----------
 Google  | Programmer | 2000 |        1
 Google  | Sales      | 2000 |        2
 Google  | Sales      | 2001 |        2
 Google  | Sales      | 2002 |        2
 Google  | Sales      | 2004 |        3
 Mozilla | Sales      | 2002 |        4
(6 rows)

As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:

Step 3) Final query

select company,
       profession,
       array_agg(year) as years
from (
  select company,
       profession,
       year,
       sum(group_cnt) over (order by company, profession, year) as group_nr
  from ( 
    select company, 
           profession,
           year,
           case 
              when row_number() over (partition by company, profession order by year) = 1 or 
                   year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
              else 0
           end as group_cnt
    from qualification
  ) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr

This returns the following result:

 company | profession |      years
---------+------------+------------------
 Google  | Programmer | {2000}
 Google  | Sales      | {2000,2001,2002}
 Google  | Sales      | {2004}
 Mozilla | Sales      | {2002}
(4 rows)

Which is exactly what you wanted, if I'm not mistaken.

Affected answered 4/11, 2011 at 20:4 Comment(2)
This absolutely 100% what I needed, and I very much appreciate all of the explanation and examples you've put into this!Avaunt
Very descriptive, and I also like the fact how this solution uses window functions of three different kinds (sum(), row_number(), lag()), making it a good material on how to use them (combined in one solution too).Kiona
K
22

There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.

And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.

That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:

WITH MarkedForGrouping AS (
  SELECT
    company,
    profession,
    year,
    year - ROW_NUMBER() OVER (
      PARTITION BY company, profession
      ORDER BY year
    ) AS seqID
  FROM atable
)
SELECT
  company,
  profession,
  array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
  company,
  profession,
  seqID
Kiona answered 5/11, 2011 at 0:37 Comment(1)
It should be noted that this is the solution I in the long run, opted for using. It was for a single job which ran only periodically, so no need for a stored procedure. The code is clean, precise, and works as intended. Thanks so much!Avaunt
D
5

Procedural solution with PL/pgSQL

The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with PL/pgSQL can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:

Test table:

CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
  ('Google',  'Programmer', 2000)
, ('Google',  'Sales',      2000)
, ('Google',  'Sales',      2001)
, ('Google',  'Sales',      2002)
, ('Google',  'Sales',      2004)
, ('Mozilla', 'Sales',      2002)
;

Function:

CREATE OR REPLACE FUNCTION f_periods()
  RETURNS TABLE (company text, profession text, years int[])
  LANGUAGE plpgsql AS
$func$
DECLARE
   r  tbl; -- use table type as row variable
   r0 tbl;
BEGIN
   FOR r IN
      SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
   LOOP
      IF ( r.company,  r.profession,  r.year)
      <> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row

         RETURN QUERY
         SELECT r0.company, r0.profession, years; -- output row

         years := ARRAY[r.year];     -- start new array
      ELSE
         years := years || r.year;   -- add to array - year can be NULL, too
      END IF;

      r0 := r;                       -- remember last row
   END LOOP;

   RETURN QUERY                      -- output last iteration
   SELECT r0.company, r0.profession, years;
END
$func$;

Call:

SELECT * FROM f_periods();

db<>fiddle here

Produces the requested result.

Depressed answered 4/11, 2011 at 19:24 Comment(3)
Slightly edited examples to give you better idea. Notice Google,Sales returns 2 results, the first containing the years 2000, 2001, 2002 because they are in sequential order while 2004 is a separate entry because there is no 2003 to create a complete sequence.Avaunt
@ErwinBrandstetter: Nice solution and I agree that your version is probably a lot faster as it requires only a single scan over the table.Affected
@a_horse_with_no_name: though I have to admit that your version is an impressive demonstration of what window functions can do.Depressed

© 2022 - 2024 — McMap. All rights reserved.