Efficiently finding duplicates in a list

Asked 3/10, 2017 at 23:22 Answered 15/12, 2023 at 17:12

Solved python algorithm performance list time-complexity

I have the function below which searches an array for duplicate entries and then returns a list of the duplicates. I'd like to speed this piece of my code up, can anyone suggest a more efficient way?

Code:

def findDupe(array):
    dupelist = []
    for i in range(len(array)):
        for j in range(len(array)):
            comp1 = array[i]
            comp2 = array[j]
            if comp1 == comp2 and i!=j:
                if comp2 not in dupelist:
                    dupelist.append(comp2)
    return dupelist

Suburb answered 3/10, 2017 at 23:22 Comment(0)

The idea here is to do a single sweep in linear time. You can use a counter to do this. The idea is to count each element and then return all those which were counted multiple times.

from collections import Counter

def get_duplicates(array):
    c = Counter(array)
    return [k for k in c if c[k] > 1]

The approach above is linear in complexity, but makes two passes over the input - once, to count (this is abstracted away by the Counter constructor) and the other is to return non-unique values in the list comp. You can optimise this a bit using a yield statement, and return duplicates as you find them.

def get_duplicates(array):
    c = Counter()
    seen = set()
    for i in array: 
        c[i] += 1
        if c[i] > 1 and i not in seen:
            seen.add(i)
            yield i

This results in a compulsory if check and extra space in the form of a set, but you reduce two passes to one.

Weather answered 3/10, 2017 at 23:26 Comment(9)

@COLDSPEED Thank you for getting back to me so quickly! So if I understand correctly the run time for your final suggestion would be a linear function of the input array. Is that correct? How does that differ from the run time of the original function I provided? – Suburb 4/10, 2017 at 0:7

@Suburb You have a loop within a loop. This means that your function is quadratic, which is much much slower than linear. – Weather 4/10, 2017 at 0:13

@COLDSPEED Hi, I tried printing the results of your suggestion with testArray = ['a','b','c','d','e','d'] using print get_duplicates(testArray) and I'm getting the following message <generator object get_duplicates at 0x107609eb0> , if I want to print the results what do I need to do? I'm not very familiar with generators. – Suburb 4/10, 2017 at 3:40

@Suburb Yes, yield returns a generator. Convert to a list using: x = list(get_duplicates(...)) – Weather 4/10, 2017 at 3:41

@COLDSPEED Thank you, much quicker than previous version! – Suburb 4/10, 2017 at 4:13

@Suburb yes! That's the power of time complexity analysis. – Weather 4/10, 2017 at 4:13

@COLDSPEED I added the results of using %timeit to the original post. It looks like if I add list() I'm not getting a gain in run time. Am I not using %timeit correctly? Or is there something else I'm missing? – Suburb 4/10, 2017 at 4:24

@Suburb Well, the thing with time complexity, is you usually don't see a difference unless your input size increases. For example, if I replace your example with a list of size 1000, your method takes 200ms, while mine takes 1ms. For data of size 1 million, your function will take hours to complete, while mine takes one and a half seconds :) – Weather 4/10, 2017 at 4:29

[k for k in c if c[k] > 1] or even [k for k, v in c.items() if 1 < v]. – Slingshot 3/12, 2023 at 18:49

What is the type of elements in your lists?

Storing elements in a Set, as explained above, gives you average complexity Θ(n) but requires the elements to be hashable (a Set in Python is implemented with hash table, see https://wiki.python.org/moin/TimeComplexity)
If you have a comparison function, you can sort the list in worst case Θ(nlog(n)) then compare each element of the list to the next.
If you have a comparison function, you can also implement a set with a (balanced) BST and do the same as 1 to achieve complexity Θ(nlog(n)) in average (in worst case).

Chuck answered 5/10, 2017 at 7:35 Comment(0)

While numpy does not have the duplicated method (yet), pandas does:

import pandas as pd

df = pd.Series(input_list)

duplicated_values = df[df.duplicated()].to_list()

Keen answered 3/12, 2023 at 16:17 Comment(0)

Alternative with no imports required:

dups = set()
distinct_ids = set()
for an_id in ids:
    if an_id not in distinct_ids:
        distinct_ids.add(an_id)
    else:
        dups.add(an_id)

Syneresis answered 15/12, 2023 at 17:12 Comment(0)

Recommended topics

Hot tags