How to do count(distinct) for multiple columns
Asked Answered
B

6

5

This does not work:

select count(distinct colA, colB) from mytable

I know I can simply solve this by making a double select.

select count(*) from (
    select distinct colA, colB from mytable
)

Is there anyway I can do this without having to do the sub-select?

Boswall answered 4/11, 2021 at 8:39 Comment(2)
Keep the subquery solution.Concupiscent
What is wrong with a subquery?Retiarius
B
1

[TL;DR] Just use a sub-query.


If you are trying to use concatenation then you need to ensure that you delimit the terms with a string that is never going to appear in the values otherwise you will find non-distinct terms grouped together.

For example: if you have a two numeric column then using COUNT(DISTINCT col1 || col2) will group together 1||23 and 12||3 and count them as one group.

You could use COUNT(DISTINCT col1 || '-' || col2) but if the columns are string values and you have 'ab-'||'-'||'c' and 'ab'||'-'||'-c' then, once again, they would be identical once concatenated.

The simplest method is to use a sub-query.

If you can't do that then you can combine columns via string-concatenation but you need to analyse the contents of the column and pick a delimiter that does not appear in your strings otherwise your results might be erroneous. Even better is to ensure that the delimiter character will never be in the sub-string with check constraints.

ALTER TABLE mytable ADD CONSTRAINT mytable__col1__chk CHECK (col1 NOT LIKE '%¬%');
ALTER TABLE mytable ADD CONSTRAINT mytable__col2__chk CHECK (col2 NOT LIKE '%¬%');

Then:

SELECT COUNT(DISTINCT col1 || '¬' || col2)
FROM   mytable;
Blake answered 4/11, 2021 at 8:59 Comment(1)
Also, I suppose the subquery performance will be much better than the concatenation.Concupiscent
C
2

Subquery is standard solution which I recommend too. Concatenation-based solutions, except they are error-prone if dangerous character occurs, might be also worse in performance.

Note: in case you collected obscure solutions how to avoid subquery, window function usage is also possible here (Not to be used in production - your code reviewers won't praise you for it):

select distinct count(*) over ()
from my_table
group by colA, colB
Crazyweed answered 4/11, 2021 at 9:10 Comment(0)
T
2

Just for fun, you can (ab)use window functions and limit clause. These are evaluated after grouping. So:

SELECT COUNT(*) OVER()
FROM t
GROUP BY col_a, col_b
OFFSET 0 ROWS FETCH NEXT 1 ROWS ONLY
Toxicology answered 4/11, 2021 at 9:25 Comment(0)
T
1

If you're trying to avoid sub-selects at all costs, one variant would be to concatenate them as such:

SELECT count(DISTINCT concat(colA, colB)) FROM mytable;
Thou answered 4/11, 2021 at 8:42 Comment(1)
Will count ('AB', 'C') and ('A', 'BC') as one distinct combination.Concupiscent
B
1

[TL;DR] Just use a sub-query.


If you are trying to use concatenation then you need to ensure that you delimit the terms with a string that is never going to appear in the values otherwise you will find non-distinct terms grouped together.

For example: if you have a two numeric column then using COUNT(DISTINCT col1 || col2) will group together 1||23 and 12||3 and count them as one group.

You could use COUNT(DISTINCT col1 || '-' || col2) but if the columns are string values and you have 'ab-'||'-'||'c' and 'ab'||'-'||'-c' then, once again, they would be identical once concatenated.

The simplest method is to use a sub-query.

If you can't do that then you can combine columns via string-concatenation but you need to analyse the contents of the column and pick a delimiter that does not appear in your strings otherwise your results might be erroneous. Even better is to ensure that the delimiter character will never be in the sub-string with check constraints.

ALTER TABLE mytable ADD CONSTRAINT mytable__col1__chk CHECK (col1 NOT LIKE '%¬%');
ALTER TABLE mytable ADD CONSTRAINT mytable__col2__chk CHECK (col2 NOT LIKE '%¬%');

Then:

SELECT COUNT(DISTINCT col1 || '¬' || col2)
FROM   mytable;
Blake answered 4/11, 2021 at 8:59 Comment(1)
Also, I suppose the subquery performance will be much better than the concatenation.Concupiscent
M
0

Concatenate them.

Select count(distinct colA ||'-'|| colB) from mytable;
Michelinamicheline answered 4/11, 2021 at 8:41 Comment(5)
well I know the Concatenate method, so now I am ashames I did not think of it myself :D Thank!Boswall
Will count ('AB', 'C') and ('A', 'BC') as one distinct combination.Concupiscent
Now it won't; thank you, @jarlh.Michelinamicheline
Now ('A-', 'B') and ('A', '-B') are both counted as ('A--B').Concupiscent
Oh well, if there's really such a data that nothing of suggested simple options helps, then subquery it is.Michelinamicheline
Q
0

NB: The answer is for SQLite

There's another option using JSON:

SELECT count(DISTINCT json_object('colA', colA, 'colB', colB))
FROM mytable;
Quaternary answered 3/9, 2023 at 23:21 Comment(2)
In PostgreSQL you could simply use row values for this, e.g., SELECT count(DISTINCT (colA, colB)) FROM mytable;, but SQLite limits the usage of row values for some reason -- sqlite.org/rowvalue.html. json_object() can be used as a substitution where you'd normally use a row value.Quaternary
Sorry, I didn't notice the tags for this question -- I thought it was for SQLite as I've been searching for SQLite options.Quaternary

© 2022 - 2024 — McMap. All rights reserved.