I am using Snowflake database and ran this query to find total count, number of distinct records and difference:
select
(select count(*) from mytable) as total_count,
(select count(*) from (select distinct * from mytable)) as distinct_count,
(select count(*) from mytable) - (select count(*) from (select distinct * from mytable)) as duplicate_count
from mytable limit 1;
Result:
1,759,867
1,738,924
20,943 (duplicate_count)
But when try with the other approach (group ALL columns and find where count is > 1):
select count(*) from (
SELECT
a, b, c, d, e,
COUNT(*)
FROM
mytable
GROUP BY
a, b, c, d, e
HAVING
COUNT(*) > 1
)
I get 5,436
.
Why there is a difference in number of duplicates? (20,943
vs 5,436
)
Thanks.