Postgres - select non-blank non-null values from multiple ordered rows
Asked Answered
U

4

7

There are lots of data coming from multiple sources that I need to group based on priority, but the data quality from those sources is different - they may be missing some data. The task is to group that data into a separate table, in as complete as possible way.

For example:

create table grouped_data (
  id serial primary key,
  type text,
  a text,
  b text,
  c int
);

create table raw_data (
  id serial primary key,
  type text,
  a text,
  b text,
  c int,
  priority int
);


insert into raw_data
(type, a,       b,         c,   priority)
values
('one', null,    '',        123, 1),
('one', 'foo',   '',        456, 2),
('one', 'bar',   'baz',     789, 3),
('two', null,    'two-b',   11,  3),
('two', '',      '',        33,  2),
('two', null,    'two-bbb', 22,  1);

Now I need to group records by type, order by priority, take the first non-null and non-empty value, and put it into grouped_data. In this case, value of a for group one would be foo because the row that holds that value have a higher priority than the one with bar. And c should be 123, as it has the highest prio. Same for group two, for each column we take the data that is non-null, non-empty, and has the highest priority, or fallback to null if no actual data present.

In the end, grouped_data is expected to have the following content:

('one', 'foo', 'baz',     123),
('two', null,  'two-bbb', 22)

I've tried grouping, sub-selects, MERGE, cross joins... Alas, my knowledge of PostgreSQL is not good enough to get it working. One thing I'd like to avoid, too - is going through columns one-by-one, since in the real world there are few dozens of columns to work with...

A link to a fiddle I've been using to mess around with this: http://sqlfiddle.com/#!17/76699/1


UPD:

Thank you all! Oleksii Tambovtsev's solution is the fastest one. On a set of data closely resembling a real-world case (2m records, ~30 fields) it takes only 20 seconds to produce the exact same set of data, which was previously generated programmatically and took over 20 minutes.

eshirvana's solution does the same in 95s, Steve Kass' in 125s, and Stefanov.sm - 308s (which is still helluvalotfaster than programatically!)

Thank you all :)

Unstrap answered 22/12, 2021 at 18:32 Comment(0)
A
6

You should try this:

SELECT
       type,
       (array_agg(a ORDER BY priority ASC) FILTER (WHERE a IS NOT NULL AND a != ''))[1] as a,
       (array_agg(b ORDER BY priority ASC) FILTER (WHERE b IS NOT NULL AND b != ''))[1] as b,
       (array_agg(c ORDER BY priority ASC) FILTER (WHERE c IS NOT NULL))[1] as c
FROM raw_data GROUP BY type ORDER BY type;
Averir answered 22/12, 2021 at 18:59 Comment(1)
and what if there's a need to merge values of array type columns? Let's say there's now also a column d added with a type "array of strings" and default value of "'{}'::character varying[]". How would I merge that?Unstrap
P
4

you can use window function first_value:

select distinct 
    type 
  , first_value(a) over (partition by type order by nullif(a,'') is null, priority) as a
  , first_value(b) over (partition by type order by nullif(b,'') is null, priority)  as b
  , first_value(c) over (partition by type order by priority) as c
from raw_data 
Parishioner answered 22/12, 2021 at 18:58 Comment(0)
A
1
select distinct on (type) type, 
  first_value(a) over (partition by type order by (nullif(a, '') is null), priority) a, 
  first_value(b) over (partition by type order by (nullif(b, '') is null), priority) b, 
  first_value(c) over (partition by type order by (c is null), priority) c
from raw_data;
Arrington answered 22/12, 2021 at 19:4 Comment(0)
A
1

This should also work.

WITH types(type) AS (
  SELECT DISTINCT
    type
  FROM raw_data
)
SELECT
  type,
  (SELECT a FROM raw_data WHERE a > '' AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS a,
  (SELECT b FROM raw_data WHERE b > '' AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS b,
  (SELECT c FROM raw_data WHERE c IS NOT NULL AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS c
FROM types
ORDER BY type;
Actinochemistry answered 22/12, 2021 at 19:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.