Dynamic alternative to pivot with CASE and GROUP BY
Asked Answered
K

6

36

I have a table that looks like this:

id    feh    bar
1     10     A
2     20     A
3      3     B
4      4     B
5      5     C
6      6     D
7      7     D
8      8     D

And I want it to look like this:

bar  val1   val2   val3
A     10     20 
B      3      4 
C      5        
D      6      7     8

I have this query that does this:

SELECT bar, 
   MAX(CASE WHEN abc."row" = 1 THEN feh ELSE NULL END) AS "val1",
   MAX(CASE WHEN abc."row" = 2 THEN feh ELSE NULL END) AS "val2",
   MAX(CASE WHEN abc."row" = 3 THEN feh ELSE NULL END) AS "val3"
FROM
(
  SELECT bar, feh, row_number() OVER (partition by bar) as row
  FROM "Foo"
 ) abc
GROUP BY bar 

Is there a way to improve this query or write it more efficiently altogether?

Korea answered 19/3, 2013 at 17:16 Comment(3)
If you can load the standard-contrib tablefunc module, the crosstab function will do what you want.Lankton
@Korea You can use PL/PgSQL to query for the entry with the most values and use EXECUTE with the format function to generate the CASE statements dynamically. Daniel is right that it's probably better to just use crosstab though.Nino
@CraigRinger: I went into detail to explore "dynamic" possibilities with crosstab().Highlight
H
66

If you have not installed the additional module tablefunc, run this command once per database:

CREATE EXTENSION tablefunc;

Answer to question

A very basic crosstab solution for your case:

SELECT * FROM crosstab(
  'SELECT bar, 1 AS cat, feh
   FROM   tbl_org
   ORDER  BY bar, feh')
 AS ct (bar text, val1 int, val2 int, val3 int);  -- more columns?

The special difficulty here is, that there is no category (cat) in the base table. For the basic 1-parameter form we can just provide a dummy column with a dummy value serving as category. The value is ignored anyway.

This is one of the rare cases where the second parameter for the crosstab() function is not needed, because all NULL values only appear in dangling columns to the right by definition of this problem. And the order can be determined by the value.

If we had an actual category column with names determining the order of values in the result, we'd need the 2-parameter form of crosstab(). Here I synthesize a category column with the help of the window function row_number(), to base crosstab() on:

SELECT * FROM crosstab(
   $$
   SELECT bar, val, feh
   FROM  (
      SELECT *, 'val' || row_number() OVER (PARTITION BY bar ORDER BY feh) AS val
      FROM tbl_org
      ) x
   ORDER BY 1, 2
   $$
 , $$VALUES ('val1'), ('val2'), ('val3')$$         -- more columns?
) AS ct (bar text, val1 int, val2 int, val3 int);  -- more columns?

The rest is pretty much run-of-the-mill. Find more explanation and links in these closely related answers.

Basics:
Read this first if you are not familiar with the crosstab() function!

Advanced:

Proper test setup

The test setup that's missing in the question:

CREATE TEMP TABLE tbl_org (id int, feh int, bar text);
INSERT INTO tbl_org (id, feh, bar) VALUES
  (1, 10, 'A')
, (2, 20, 'A')
, (3,  3, 'B')
, (4,  4, 'B')
, (5,  5, 'C')
, (6,  6, 'D')
, (7,  7, 'D')
, (8,  8, 'D')
;

Dynamic crosstab?

Not very dynamic, yet, as @Clodoaldo commented. Dynamic return types are hard to achieve with plpgsql. But there are ways around it - with some limitations.

So not to further complicate the rest, I demonstrate with a simpler test case:

CREATE TEMP TABLE tbl (row_name text, attrib text, val int);
INSERT INTO tbl (row_name, attrib, val) VALUES
  ('A', 'val1', 10)
, ('A', 'val2', 20)
, ('B', 'val1', 3)
, ('B', 'val2', 4)
, ('C', 'val1', 5)
, ('D', 'val3', 8)
, ('D', 'val1', 6)
, ('D', 'val2', 7)
;

Call:

SELECT * FROM crosstab('SELECT row_name, attrib, val FROM tbl ORDER BY 1,2')
AS ct (row_name text, val1 int, val2 int, val3 int);

Returns:

 row_name | val1 | val2 | val3
----------+------+------+------
 A        | 10   | 20   |
 B        |  3   |  4   |
 C        |  5   |      |
 D        |  6   |  7   |  8

Built-in feature of tablefunc module

The tablefunc module provides a simple infrastructure for generic crosstab() calls without providing a column definition list. A number of functions written in C (very fast):

crosstabN()

crosstab1() - crosstab4() are pre-defined. One minor point: they require and return all text. So we need to cast our integer values. But it simplifies the call:

SELECT * FROM crosstab4('SELECT row_name, attrib, val::text  -- cast!
                         FROM tbl ORDER BY 1,2')

Result:

row_name category_1 category_2 category_3 category_4
A 10 20
B 3 4
C 5
D 6 7 8

Custom crosstab() function

For more columns or other data types, we create our own composite type and function (once).
Type:

CREATE TYPE tablefunc_crosstab_int_5 AS (
  row_name text, val1 int, val2 int, val3 int, val4 int, val5 int);

Function:

CREATE OR REPLACE FUNCTION crosstab_int_5(text)
  RETURNS SETOF tablefunc_crosstab_int_5
AS '$libdir/tablefunc', 'crosstab' LANGUAGE c STABLE STRICT;

Call:

SELECT * FROM crosstab_int_5('SELECT row_name, attrib, val   -- no cast!
                              FROM tbl ORDER BY 1,2');

Result:

row_name val1 val2 val3 val4 val5
A 10 20
B 3 4
C 5
D 6 7 8

One polymorphic, dynamic function for all

This goes beyond what's covered by the tablefunc module.
To make the return type dynamic I use a polymorphic type with a technique detailed in this related answer:

1-parameter form:

CREATE OR REPLACE FUNCTION crosstab_n(_qry text, _rowtype anyelement)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN QUERY EXECUTE 
   (SELECT format('SELECT * FROM crosstab(%L) t(%s)'
                , _qry
                , string_agg(quote_ident(attname) || ' ' || atttypid::regtype
                           , ', ' ORDER BY attnum))
    FROM   pg_attribute
    WHERE  attrelid = pg_typeof(_rowtype)::text::regclass
    AND    attnum > 0
    AND    NOT attisdropped);
END
$func$;

Overload with this variant for the 2-parameter form:

CREATE OR REPLACE FUNCTION crosstab_n(_qry text, _cat_qry text, _rowtype anyelement)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN QUERY EXECUTE 
   (SELECT format('SELECT * FROM crosstab(%L, %L) t(%s)'
                , _qry, _cat_qry
                , string_agg(quote_ident(attname) || ' ' || atttypid::regtype
                           , ', ' ORDER BY attnum))
    FROM   pg_attribute
    WHERE  attrelid = pg_typeof(_rowtype)::text::regclass
    AND    attnum > 0
    AND    NOT attisdropped);
END
$func$;

pg_typeof(_rowtype)::text::regclass: There is a row type defined for every user-defined composite type, so that attributes (columns) are listed in the system catalog pg_attribute. The fast lane to get it: cast the registered type (regtype) to text and cast this text to regclass.

Create composite types once

You need to define once every return type you are going to use:

CREATE TYPE tablefunc_crosstab_int_3 AS (
    row_name text, val1 int, val2 int, val3 int);

CREATE TYPE tablefunc_crosstab_int_4 AS (
    row_name text, val1 int, val2 int, val3 int, val4 int);

...

For ad-hoc calls, you can also just create a temporary table to the same (temporary) effect:

CREATE TEMP TABLE temp_xtype7 AS (
    row_name text, x1 int, x2 int, x3 int, x4 int, x5 int, x6 int, x7 int);

Or use the type of an existing table, view or materialized view if available.

Call

Using above row types:

1-parameter form (no missing values):

SELECT * FROM crosstab_n(
   'SELECT row_name, attrib, val FROM tbl ORDER BY 1,2'
 , NULL::tablefunc_crosstab_int_3);

2-parameter form (some values can be missing):

SELECT * FROM crosstab_n(
   'SELECT row_name, attrib, val FROM tbl ORDER BY 1'
 , $$VALUES ('val1'), ('val2'), ('val3')$$
 , NULL::tablefunc_crosstab_int_3);

This one function works for all return types, while the crosstabN() framework provided by the tablefunc module needs a separate function for each.
If you have named your types in sequence like demonstrated above, you only have to replace the bold number. To find the maximum number of categories in the base table:

SELECT max(count(*)) OVER () FROM tbl  -- returns 3
GROUP  BY row_name
LIMIT  1;

That's about as dynamic as this gets if you want individual columns. Arrays like demonstrated by @Clocoaldo or a simple text representation or the result wrapped in a document type like json or hstore can work for any number of categories dynamically.

Disclaimer:
It's always potentially dangerous when user input is converted to code. Make sure this cannot be used for SQL injection. Don't accept input from untrusted users (directly).

Call for original question:

SELECT * FROM crosstab_n('SELECT bar, 1, feh FROM tbl_org ORDER BY 1,2'
                       , NULL::tablefunc_crosstab_int_3);
Highlight answered 20/3, 2013 at 2:42 Comment(7)
Not all that dynamic since the computed column names must be provided. And the OP says they can be many.Opponent
@ClodoaldoNeto: I dug deeper. You may be interested in the sizable update to my answer.Highlight
Yes very good, but still one must know the number of columns and create the types. I made it completely dynamic here but I have a genetic allergy to complexity so I think the array solution I gave for this question is much better if column names taken from the data are not required.Opponent
@ClodoaldoNeto: "Better" is defined by requirements.Highlight
@ClodoaldoNeto: I only just now realized your link under here that's referring to another answer. Very nice. Seems pretty useful for long lists of categories or ad-hoc queries. But it needs two queries. One to create the table, one to read from it. With two queries, one can make everything "dynamic". 1. build query string, 2. execute it. The challenge is to do it in a single query and SQL wants to know the return type up front.Highlight
One only query would be nice. But is seems not possible in the current state of tablefunc as you know by trying more than once. I guess the questions' authors do not care about number of queries as long as the performance is adequate for a given scenario. I don't remember why I opted for the anonymous block in that answer but I guess it could be a function and that would make it one only query.Opponent
@ClodoaldoNeto: No it wouldn't. Not for a dynamic result type. Try it. :)Highlight
L
24

Although this is an old question, I would like to add another solution made possible by recent improvements in PostgreSQL. This solution achieves the same goal of returning a structured result from a dynamic data set without using the crosstab function at all. In other words, this is a good example of re-examining unintentional and implicit assumptions that prevent us from discovering new solutions to old problems. ;)

To illustrate, you asked for a method to transpose data with the following structure:

id    feh    bar
1     10     A
2     20     A
3      3     B
4      4     B
5      5     C
6      6     D
7      7     D
8      8     D

into this format:

bar  val1   val2   val3
A     10     20 
B      3      4 
C      5        
D      6      7     8

The conventional solution is a clever (and incredibly knowledgeable) approach to creating dynamic crosstab queries that is explained in exquisite detail in Erwin Brandstetter's answer.

However, if your particular use case is flexible enough to accept a slightly different result format, then another solution is possible that handles dynamic pivots beautifully. This technique, which I learned of here

uses PostgreSQL's new jsonb_object_agg function to construct pivoted data on the fly in the form of a JSON object.

I will use Mr. Brandstetter's "simpler test case" to illustrate:

CREATE TEMP TABLE tbl (row_name text, attrib text, val int);
INSERT INTO tbl (row_name, attrib, val) VALUES
   ('A', 'val1', 10)
 , ('A', 'val2', 20)
 , ('B', 'val1', 3)
 , ('B', 'val2', 4)
 , ('C', 'val1', 5)
 , ('D', 'val3', 8)
 , ('D', 'val1', 6)
 , ('D', 'val2', 7);

Using the jsonb_object_agg function, we can create the required pivoted result set with this pithy beauty:

SELECT
  row_name AS bar,
  json_object_agg(attrib, val) AS data
FROM tbl
GROUP BY row_name
ORDER BY row_name;

Which outputs:

 bar |                  data                  
-----+----------------------------------------
 A   | { "val1" : 10, "val2" : 20 }
 B   | { "val1" : 3, "val2" : 4 }
 C   | { "val1" : 5 }
 D   | { "val3" : 8, "val1" : 6, "val2" : 7 }

As you can see, this function works by creating key/value pairs in the JSON object from the attrib and value columns in the sample data, all grouped by row_name.

Although this result set obviously looks different, I believe it will actually satisfy many (if not most) real world use cases, especially those where the data requires a dynamically-generated pivot, or where resulting data is consumed by a parent application (e.g., needs to be re-formatted for transmission in a http response).

Benefits of this approach:

  • Cleaner syntax. I think everyone would agree that the syntax for this approach is far cleaner and easier to understand than even the most basic crosstab examples.

  • Completely dynamic. No information about the underlying data need be specified beforehand. Neither the column names nor their data types need be known ahead of time.

  • Handles large numbers of columns. Since the pivoted data is saved as a single jsonb column, you will not run up against PostgreSQL's column limit (≤1,600 columns, I believe). There is still a limit, but I believe it is the same as for text fields: 1 GB per JSON object created (please correct me if I am wrong). That's a lot of key/value pairs!

  • Simplified data handling. I believe that the creation of JSON data in the DB will simplify (and likely speed up) the data conversion process in parent applications. (You will note that the integer data in our sample test case was correctly stored as such in the resulting JSON objects. PostgreSQL handles this by automatically converting its intrinsic data types to JSON in accordance with the JSON specification.) This will effectively eliminate the need to manually cast data passed to parent applications: it can all be delegated to the application's native JSON parser.

Differences (and possible drawbacks):

  • It looks different. There's no denying that the results of this approach look different. The JSON object is not as pretty as the crosstab result set; however, the differences are purely cosmetic. The same information is produced--and in a format that is probably more friendly for consumption by parent applications.

  • Missing keys. Missing values in the crosstab approach are filled in with nulls, while the JSON objects are simply missing the applicable keys. You will have to decide for your self if this is an acceptable trade off for your use case. It seems to me that any attempt to address this problem in PostgreSQL will greatly complicate the process and likely involve some introspection in the form of additional queries.

  • Key order is not preserved. I don't know if this can be addressed in PostgreSQL, but this issue is mostly cosmetic also, since any parent applications are either unlikely to rely on key order, or have the ability to determine proper key order by other means. The worst case will probably only require an addition query of the database.

Conclusion

I am very curious to hear the opinions of others (especially @ErwinBrandstetter's) on this approach, especially as it pertains to performance. When I discovered this approach on Andrew Bender's blog, it was like getting hit in the side of the head. What a beautiful way to take a fresh approach to a difficult problem in PostrgeSQL. It solved my use case perfectly, and I believe it will likewise serve many others as well.

Liborio answered 17/10, 2016 at 18:57 Comment(7)
I built on your answer for completeness. https://mcmap.net/q/28876/-dynamic-alternative-to-pivot-with-case-and-group-byOpponent
Thanks for adding your expertise! I think this method is the best alternative to crosstabs when a dynamic query is need.Liborio
Any thoughts on how to order by a specific data key? Seems like it doesn't like the traditional data->>key operator due to data being an alias.Rochkind
See Clodoaldo's excellent and comprehensive answer below: https://mcmap.net/q/28876/-dynamic-alternative-to-pivot-with-case-and-group-by.Liborio
I just saw this just now. Like I wrote: the result wrapped in a document type like json or hstore can work for any number of categories dynamically. You implemented this with modern tools for jsonb. If a single aggregated result column is acceptable, the solution is great. (There are many similar, older answers around. Example.) But table-functions returning truly dynamic columns are still impossible (incl. pg 10). Probably not going to change any time soon, the limitation is anchored in the strictly typed SQL language itself.Highlight
Solid answer. Curious, how to turn the JSON keys into columns now, as asked by the OP?Command
Just a note on the "Key order": You can add ORDER BY to json_object_agg like: json_object_agg(attrib, val ORDER BY attrib DESC) AS dataOchrea
O
11

This is to complete @Damian good answer. I have already suggested the JSON approach in other answers before the 9.6's handy json_object_agg function. It just takes more work with the previous tool set.

Two of the cited possible drawbacks are really not. The random key order is trivially corrected if necessary. The missing keys, if relevant, takes an almost trivial amount of code to be addressed:

select
    row_name as bar,
    json_object_agg(attrib, val order by attrib) as data
from
    tbl
    right join
    (
        (select distinct row_name from tbl) a
        cross join
        (select distinct attrib from tbl) b
    ) c using (row_name, attrib)
group by row_name
order by row_name
;
 bar |                     data                     
-----+----------------------------------------------
 a   | { "val1" : 10, "val2" : 20, "val3" : null }
 b   | { "val1" : 3, "val2" : 4, "val3" : null }
 c   | { "val1" : 5, "val2" : null, "val3" : null }
 d   | { "val1" : 6, "val2" : 7, "val3" : 8 }

For a final query consumer which understands JSON there are no drawbacks. The only one is that it can not be consumed as a table source.

Opponent answered 4/2, 2017 at 14:46 Comment(2)
Is there a way to transform the JSON data to a table with columns?Manager
@alearnerhasnoname , there is indeed a way to transform the JSON into a table. I'm working on a Q&A entry for JSON crosstab-pivots using the json_populate_recordset() method. Here's a dbfiddle: dbfiddle.uk/Sn7iO4zLSneed
O
4

In your case I guess an array is good. SQL Fiddle

select
    bar,
    feh || array_fill(null::int, array[c - array_length(feh, 1)]) feh
from
    (
        select bar, array_agg(feh) feh
        from foo
        group by bar
    ) s
    cross join (
        select count(*)::int c
        from foo
        group by bar
        order by c desc limit 1
    ) c(c)
;
 bar |      feh      
-----+---------------
 A   | {10,20,NULL}
 B   | {3,4,NULL}
 C   | {5,NULL,NULL}
 D   | {6,7,8}
Opponent answered 20/3, 2013 at 0:51 Comment(0)
S
1

I'm sorry about returning in the past, but the solution "Dynamic Crosstab" returns erroneous result table. Thus, the valN values are erroneously "aligned to the left" and they don't correspond to the column names. When the input table has "holes" in the values, e.g. "C" has val1 and val3 but not val2. This produces an error: val3 value will be ranged in the column val2 (i.e. the next free column) in the final table.

CREATE TEMP TABLE tbl (row_name text, attrib text, val int); 
INSERT INTO tbl (row_name, attrib, val) VALUES ('C', 'val1', 5) ('C', 'val3', 7);

SELECT * FROM crosstab('SELECT row_name, attrib, val FROM tbl 
ORDER BY 1,2') AS ct (row_name text, val1 int, val2 int, val3 int);

row_name|val1|val2|val3
 C      |   5|  7 |

In order to return correct cells with "holes" in the right column, the crosstab query requires a 2nd SELECT in the crosstab, something like this "crosstab('SELECT row_name, attrib, val FROM tbl ORDER BY 1,2', 'select distinct row_name from tbl order by 1')"

Smetana answered 29/5, 2015 at 6:58 Comment(0)
A
0

This isn't really dynamic in the sense that you still have to enumerate as many columns as values you are anticipating, but it's easy enough to do that this way. The one gotcha is that the columns require an ordinal key to match, with no interruptions. Also, if there're duplicate keys, it'll throw it all out of whack, so de-duping is required as well. The sets would all have to be pre-partitioned to accommodate proper sets of N.

To me it looks kludgy, so not sure if it gets much bang for the buck. But I'm adding this to the community dogpile in the hopes it'll provide some foment for someone else to come up w/ a better approach.

/** build a dataset **/
DROP TABLE IF EXISTS tmpT ;
CREATE TEMP TABLE tmpT AS
SELECT
 NULL::INT AS key
 ,NULL::INT AS ints
 ,NULL::VARCHAR(1) AS chars
 ,NULL::VARCHAR(3) AS unnest
LIMIT 0 ;

insert into tmpT (key, ints, chars, unnest)
values   (1 , 1   , 'o',  CHR( 130 - 10 ) )       
        ,(2 , 2   , 'n',  CHR( 130 - 11 ) )       
        ,(3 , 3   , 'm',            NULL  )       
        --,(4 , 4   , 'l',  CHR( 130 - 13 ) ) -- missing set       
        ,(5 , 5   , null, CHR( 130 - 14 ) )        
        ,(6 , null, 'j',  CHR( 130 - 15 ) )        
        ,(7 , 7   , null, CHR( 130 - 16 ) )         
        ,(8 , null, 'h',  CHR( 130 - 17 ) )        
        ,(9 , 9   , null, CHR( 130 - 18 ) )         
        ,(10, null, 'f' ,           NULL  )        
        ,(11, null, 'a',  CHR( 130 - 20 ) )        
        ,(12, 12  , null, CHR( 130 - 21 ) )         
 ; /** end of build a dataset **/

/** set up full set of pivotal column positions, to backfill any missing  **/
DROP TABLE IF EXISTS tGenSer ; 
CREATE TEMP TABLE tGenSer AS SELECT generate_series( 1, 1000 )::INT AS key ;

/** THEN THE PIVOT **/

/* Pivot 10 columns */
SELECT *
FROM     /* name the columns*/
(    SELECT a a ,a b ,a c ,a d ,a e ,a f ,a g ,a h ,a i ,a j /*,a k ,a l ,a m ,a n ,a o ,a p ,a q ,a r ,a s ,a t*/ /* ,a u ,a v ,a w ,a x ,a y ,a z*/ FROM ( SELECT NULL::VARCHAR(3) AS a /**seed the typed columns **/) a  
    
    UNION /** union is just a helper, to assign names to unnamed columns **/
    
    /** 20 columns **/
    SELECT * FROM
    (
        /* enumerate columns, no name */
        SELECT t1.x[1 ] ,t1.x[2 ] ,t1.x[3 ] ,t1.x[4 ] ,t1.x[5 ] ,t1.x[6 ] ,t1.x[7 ] ,t1.x[8 ] ,t1.x[9 ] ,t1.x[10] 
        FROM ( SELECT  ARRAY( SELECT  a.ints::TEXT  AS v   
                        FROM tGenSer tg /**backfill missing keys**/ 
                        LEFT JOIN tmpT a ON tg.key = a.key ORDER BY tg.key 
                        ) AS x 
             ) t1
        
        UNION ALL
        
        SELECT t1.x[1 ] ,t1.x[2 ] ,t1.x[3 ] ,t1.x[4 ] ,t1.x[5 ] ,t1.x[6 ] ,t1.x[7 ] ,t1.x[8 ] ,t1.x[9 ] ,t1.x[10] 
        FROM ( SELECT  ARRAY( SELECT  a.chars::TEXT AS v 
                        FROM tGenSer tg /**backfill missing keys**/ 
                        LEFT JOIN tmpT a ON tg.key = a.key ORDER BY tg.key 
                        ) AS x 
            ) t1
        
        UNION ALL
        
        SELECT t1.x[1 ] ,t1.x[2 ] ,t1.x[3 ] ,t1.x[4 ] ,t1.x[5 ] ,t1.x[6 ] ,t1.x[7 ] ,t1.x[8 ] ,t1.x[9 ] ,t1.x[10] 
        FROM ( SELECT  ARRAY( SELECT  a.unnest      AS v 
                        FROM tGenSer tg /**backfill missing keys**/ 
                        LEFT JOIN tmpT a ON tg.key = a.key 
                        ORDER BY tg.key 
                        ) AS x 
           ) t1
     ) a
)b
WHERE ( a,b,c,d,e,f,g,h,i,j) IS DISTINCT FROM ( NULL ,NULL ,NULL ,NULL ,NULL ,NULL ,NULL ,NULL ,NULL ,NULL )        

;
    

RESULTS:

+---+---+--+--+--+--+--+--+--+--+
| a | b |c |d |e |f |g |h |i |j |
+---+---+--+--+--+--+--+--+--+--+
| x | w |  |  |t |s |r |q |p |  |
| o | n |m |  |  |j |  |h |  |f |
| 1 | 2 |3 |  |5 |  |7 |  |9 |  |
+---+---+--+--+--+--+--+--+--+--+
Armond answered 20/5, 2021 at 0:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.