How to find duplicates (correct way)?

Asked 21/5, 2019 at 22:37 Answered 23/9, 2024 at 20:33

Solved sql duplicates snowflake-cloud-data-platform

I am using Snowflake database and ran this query to find total count, number of distinct records and difference:

select 
    (select count(*) from mytable) as total_count, 
    (select count(*) from (select distinct * from mytable)) as distinct_count,
    (select count(*) from mytable) - (select count(*) from (select distinct * from mytable)) as duplicate_count
from mytable limit 1;

Result:

1,759,867
1,738,924
20,943 (duplicate_count)

But when try with the other approach (group ALL columns and find where count is > 1):

select count(*) from (
SELECT 
    a, b, c, d, e,
    COUNT(*)
FROM 
    mytable
GROUP BY 
    a, b, c, d, e
HAVING 
    COUNT(*) > 1
)

I get 5,436.

Why there is a difference in number of duplicates? (20,943 vs 5,436)

Thanks.

Lipscomb answered 21/5, 2019 at 22:37 Comment(0)

Okay. Let's start from one simple example:

create table #test
(a int, b int, c int, d int, e int)

insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (5,4,3,2,1)
insert into #test values (5,4,3,2,1)
insert into #test values (1,1,1,1,1)

And try your subquery to understand what you will get:

SELECT 
    a, b, c, d, e,
    COUNT(*)
FROM 
    #test
GROUP BY 
    a, b, c, d, e
HAVING 
    COUNT(*) > 1

Think about a while...

Dang Dang Dang Dang ~

a   b   c   d   e   (No column name)
1   2   3   4   5   5
5   4   3   2   1   2

It will only return two rows because you used 'group by'. But it still counted the duplicate numbers for each a,b,c,d,e combinations.

If you want the total number of the duplicates, try this:

select sum(sub_count) from (
SELECT 
    a, b, c, d, e,
    COUNT(*) - 1 as sub_count
FROM 
    #test
GROUP BY 
    a, b, c, d, e
HAVING 
    COUNT(*) > 1)a

You need to minus one in this case if I understand your original queries correctly. Correct me if I am wrong.

Supervise answered 21/5, 2019 at 23:7 Comment(3)

Great, thanks.. It make sense since for each group at first I assigned only 1 count (but instead it can be more than 1 of the same duplicate).. Now the number match.. – Lipscomb 21/5, 2019 at 23:21

Is there a way to delete them from a table without using a 'temp' table? – Lipscomb 21/5, 2019 at 23:29

@Lipscomb 1, if you are asking how to delete duplicates rows. Come here: #18391074. 2, I used temp table for testing purpose. You can delete without a temp table. – Supervise 21/5, 2019 at 23:34

Because 5436 is the actual distinct records that are repeated (duplicated). Each one is duplicated any number of times more than once and possibly different times. If you take into account the total of ALL the rows that with duplicates then the count will be 20,943.

Arteriosclerosis answered 23/9, 2024 at 20:33 Comment(0)

Recommended topics

Hot tags