SQL aggregation function to choose the only value
Asked Answered
A

4

5

I have an rowset with two columns: technical_id and natural_id. The rowset is actually result of complex query. The mapping between columns values is assumed to be bijective (i.e. for two rows with same technical_id the natural_ids are same too, for distinct technical_ids the natural_ids are distinct too). The (technical_id,natural_id) pairs are not unique in rowset because of joins in original query. Example:

with t (technical_id, natural_id, val) as (values
  (1, 'a', 1),
  (1, 'a', 2),
  (2, 'b', 3),
  (2, 'b', 2),
  (3, 'c', 0),
  (3, 'c', 1),
  (4, 'd', 1)
)

Unfortunately, the bijection is enforced only by application logic. The natural_id is actually collected from multiple tables and composed using coalesce-based expression so its uniqueness hardly can be enforced by db constraint.

I need to aggregate rows of rowset by technical_id assuming the natural_id is unique. If it isn't (for example if tuple (4, 'x', 1) were added into sample data), the query should fail. In ideal SQL world I would use some hypothetical aggregate function:

select technical_id, only(natural_id), sum(val)
from t
group by technical_id;

I know there is not such function in SQL. Is there some alternative or workaround? Postgres-specific solutions are also ok.

Note that group by technical_id, natural_id or select technical_id, max(natural_id) - though working well in happy case - are both unacceptable (first because the technical_id must be unique in result under all circumstances, second because the value is potentially random and masks data inconsistency).

Thanks for tips :-)

UPDATE: the expected answer is

technical_id,v,sum
1,a,3
2,b,5
3,c,1
4,d,1

or fail when 4,x,1 is also present.

Adamina answered 23/1, 2020 at 15:19 Comment(2)
can you expand the question by showing expected results from your example data?Ileus
i'm hesitant to answer - but it feels like you might be looking for a HAVING COUNT() clauseIleus
A
2

Seems I've finally found solution based on single-row cardinality of correlated subquery in select clause:

select technical_id,
       (select v from unnest(array_agg(distinct natural_id)) as u(v)) as natural_id,
       sum(val)
from t
group by technical_id;

This is the simplest solution for my situation at this moment so I'll resort to self-accept. Anyway if some disadvantages show, I will describe them here and reaccept to other answer. I appreciate all other proposals and believe they will be valuable for anybody too.

Adamina answered 24/1, 2020 at 9:6 Comment(0)
S
3

You can get only the "unique" natural ids using:

select technical_id, max(natural_id), sum(val)
from t
group by technical_id
having min(natural_id) = max(natural_id);

If you want the query to actually fail, that is a little hard to guarantee. Here is a hacky way to do it:

select technical_id, max(natural_id), sum(val)
from t
group by technical_id
having (case when min(natural_id) = max(natural_id) then 0 else 1 / (count(*) - count(*)) end) = 0;

And a db<>fiddle illustrating this.

Spermatid answered 23/1, 2020 at 15:44 Comment(1)
Thanks, Gordon, yes, min=max solution would exclude technical_ids from result, which I don't want too. But the division-by-zero hack is really cool! For now I'm upvoting it and until morning think it over if I dare to add such hack into pull-request with respect to code comprehensibility.Spectrophotometer
A
2

Seems I've finally found solution based on single-row cardinality of correlated subquery in select clause:

select technical_id,
       (select v from unnest(array_agg(distinct natural_id)) as u(v)) as natural_id,
       sum(val)
from t
group by technical_id;

This is the simplest solution for my situation at this moment so I'll resort to self-accept. Anyway if some disadvantages show, I will describe them here and reaccept to other answer. I appreciate all other proposals and believe they will be valuable for anybody too.

Adamina answered 24/1, 2020 at 9:6 Comment(0)
H
1

You can use

SELECT technical_id, max(natural_id), count(natural_id)
...
GROUP BY technical_id;

and throw an error whenever the count is not 1.

If you want to guarantee the constraint with the database, you could do one of these:

  1. Do away with the artificial primary key.

  2. Do something complicated like this:

    CREATE TABLE id_map (
       technical_id bigint UNIQUE NOT NULL,
       natural_id text UNIQUE NOT NULL,
       PRIMARY KEY (technical_id, natural_id)
    );
    
    ALTER TABLE t
       ADD FOREIGN KEY (technical_id, natural_id) REFERENCES id_map;
    
Hagai answered 23/1, 2020 at 15:32 Comment(7)
Thanks, Laurenz, I agree the application-based solution of guarding uniqueness is still in play, I was just interested if there is some pure SQL solution. The t CTE is actually not a table so I cannot reference any artificial table from it, leaving aside it would burden application with maintenance of another table id_map.Spectrophotometer
Then my first suggestion should solve the problem, right? The alternative would be to suppress such results in the output.Hagai
I am sorry, I would prefer solution where I don't have to change database structure.Spectrophotometer
I mean the SELECT with the count. That doesn't require you to change anything in the database.Hagai
I see, my misunderstanding (I referred to your point 1). Select with count would require to postprocess result with some additional logic. The query is part of ETL pipeline which I am not sure I want to change after two different pure SQL solutions had appeared.Spectrophotometer
Ok, then what is your requirement? Ignore such data? There are only three choices: 1) prevent such data from happening (you don't want that) 2) report such data (you don't want that either) 3) ignore such data.Hagai
4) fail fast when such data occur. To point 1 (prevention) - I expect the inconsistency won't happen in practice since application does the best. To point 2 (reporting) - I don't want to add more than necessary amount of code for sanitizing such a corner case. Ignoring (point 3) is not an option. If inconsistency arises anyway, the pipeline simply fails and the cause is relatively simply recognized from PG error 21000. In any case, I thank you very much for your points.Spectrophotometer
O
1

You can create your own aggregates. ONLY is a keyword, so best not use it as the name of an aggregate. Not willing to put much time into deciding, I called it only2.

CREATE OR REPLACE FUNCTION public.only_agg(anyelement, anyelement)
 RETURNS anyelement
 LANGUAGE plpgsql
 IMMUTABLE
AS $function$
BEGIN 
  if $1 is null then return $2; end if; 
  if $2 is null then return $1; end if; 
  if $1=$2 then return $1; end if; 
  raise exception 'not only';  
END $function$;

create aggregate only2 (anyelement) ( sfunc = only_agg, stype = anyelement);

It might not do the thing you want with NULL inputs, but I don't know what you want in that case.

Oxidate answered 24/1, 2020 at 1:52 Comment(6)
Thanks, jjanes, this is definitely attractive solution, I will try it.Spectrophotometer
Why not a language SQL function that does a simple select coalesce($1, $2);? Will be a lot faster than a PL/pgSQL function (but you can't make it "fail" with an exception if both are null)Hypallage
@a_horse_with_no_name That was my first attempt, but you can't throw a real exception (that I know of) from SQL. It is hard to even divide by zero, because constant folding will make that happen at the wrong time.Oxidate
@jjanes: why would you want to throw an exception?Hypallage
That was part of the problem specification: "or fail when 4,x,1 is also present." Otherwise I'd just use min, or max, or make one named "first".Oxidate
@a_horse_with_no_name - throwing exception was my requirement since I the query is part of ETL pipeline (I described context more precisely under Laurenz Albe's answer). jjanes - your aggregate function works perfectly, I just preferred not to maintain another db object where inline solution is enough.Spectrophotometer

© 2022 - 2024 — McMap. All rights reserved.