Pythonic way of removing reversed duplicates in list
Asked Answered
I

9

27

I have a list of pairs:

[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]

and I want to remove any duplicates where

[a,b] == [b,a]

So we end up with just

[0, 1], [0, 4], [1, 4]

I can do an inner & outer loop checking for the reverse pair and append to a list if that's not the case, but I'm sure there's a more Pythonic way of achieving the same results.

Ib answered 15/12, 2016 at 12:51 Comment(7)
What about in-order duplicates? I.e. is it fine if you have two [0, 1]?Naturalize
Do you need to preserve the order of the pairs? E.g., would [1, 4], [0, 1], [0, 4] be fine or does it have to be [0, 1], [0, 4], [1, 4]?Businessman
What about identical pairs, i.e. [1, 1] or [2, 2] in the input? Do they need to be preserved as [1, 1] or is it okay if they are converted to [1]?Gere
I hate these questions that seem well posed but have a thousand little unspecified details that affect which things will or won't work. lolBusinessman
Do the elements of the list need to remain lists or can we convert them to tuples? Also does the order within a pair matter? Like can we convert [[3, 4], [4, 3], [2, 5]] to [[3, 4], [5, 2]]?Bangweulu
I would only add pairs [ a, b ] if a <= b. Or swap a and b if b < a before adding the pair. I would probably not use a list as the outer container. A set or dict is probably a better choice here. Does your other code need the [ b, a ] pairs?Ponytail
I get what you mean by [a, b] == [b, a] but I think technically it's list_1 == list_2 or list_1 == list_2[::-1]Tailored
S
26

If you need to preserve the order of the elements in the list then, you can use a the sorted function and set comprehension with map like this:

lst = [0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]
data = {tuple(item) for item in map(sorted, lst)}
# {(0, 1), (0, 4), (1, 4)}

or simply without map like this:

data = {tuple(sorted(item)) for item in lst}

Another way is to use a frozenset as shown here however note that this only work if you have distinct elements in your list. Because like set, frozenset always contains unique values. So you will end up with unique value in your sublist(lose data) which may not be what you want.

To output a list, you can always use list(map(list, result)) where result is a set of tuple only in Python-3.0 or newer.

Statolatry answered 15/12, 2016 at 14:52 Comment(3)
I don't know what happened to my previous comment but you're approach doesn't preserve the order of the elements (because you use sorted and a set-comprehension). For example: [[1, 0], [2, 0], [3, 0]] results in something like {(0, 2), (0, 1), (0, 3)} (order inside the set may vary).Gere
@Gere I will probably do it differently if OP didn't say [a,b] == [b,a]Statolatry
He said "I want to remove any duplicates where [a,b] == [b,a]" not that non-duplicates should be reordered. I don't mean that your answer is incorrect but it should be mentioned in the answer!Gere
G
15

If you only want to remove reversed pairs and don't want external libraries you could use a simple generator function (loosly based on the itertools "unique_everseen" recipe):

def remove_reversed_duplicates(iterable):
    # Create a set for already seen elements
    seen = set()
    for item in iterable:
        # Lists are mutable so we need tuples for the set-operations.
        tup = tuple(item)
        if tup not in seen:
            # If the tuple is not in the set append it in REVERSED order.
            seen.add(tup[::-1])
            # If you also want to remove normal duplicates uncomment the next line
            # seen.add(tup)
            yield item

>>> list(remove_reversed_duplicates(a))
[[0, 1], [0, 4], [1, 4]]

The generator function might be a pretty fast way to solve this problem because set-lookups are really cheap. This approach also keeps the order of your initial list and only removes reverse duplicates while being faster than most of the alternatives!


If you don't mind using an external library and you want to remove all duplicates (reversed and identical) an alternative is: iteration_utilities.unique_everseen

>>> a = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(a, key=set))
[[0, 1], [0, 4], [1, 4]]

This checks if any item has the same contents in arbitary order (thus the key=set) as another. In this case this works as expected but it also removes duplicate [a, b] instead of only [b, a] occurences. You could also use key=sorted (like the other answers suggest). The unique_everseen like this has a bad algorithmic complexity because the result of the key function is not hashable and thus the fast lookup is replaced by a slow lookup. To speed this up you need to make the keys hashable, for example by converting them to sorted tuples (like some other answers suggest):

>>> from iteration_utilities import chained
>>> list(unique_everseen(a, key=chained(sorted, tuple)))
[[0, 1], [0, 4], [1, 4]]

The chained is nothing else than a faster alternative to lambda x: tuple(sorted(x)).

EDIT: As mentioned by @jpmc26 one could use frozenset instead of normal sets:

>>> list(unique_everseen(a, key=frozenset))
[[0, 1], [0, 4], [1, 4]]

To get an idea about the performance I did some timeit comparisons for the different suggestions:

>>> a = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]

>>> %timeit list(remove_reversed_duplicates(a))
100000 loops, best of 3: 16.1 µs per loop
>>> %timeit list(unique_everseen(a, key=frozenset))
100000 loops, best of 3: 13.6 µs per loop
>>> %timeit list(set(map(frozenset, a)))
100000 loops, best of 3: 7.23 µs per loop

>>> %timeit list(unique_everseen(a, key=set))
10000 loops, best of 3: 26.4 µs per loop
>>> %timeit list(unique_everseen(a, key=chained(sorted, tuple)))
10000 loops, best of 3: 25.8 µs per loop
>>> %timeit [list(tpl) for tpl in list(set([tuple(sorted(pair)) for pair in a]))]
10000 loops, best of 3: 29.8 µs per loop
>>> %timeit set(tuple(item) for item in map(sorted, a))
10000 loops, best of 3: 28.5 µs per loop

Long list with many duplicates:

>>> import random
>>> a = [[random.randint(0, 10), random.randint(0,10)] for _ in range(10000)]

>>> %timeit list(remove_reversed_duplicates(a))
100 loops, best of 3: 12.5 ms per loop
>>> %timeit list(unique_everseen(a, key=frozenset))
100 loops, best of 3: 10 ms per loop
>>> %timeit set(map(frozenset, a))
100 loops, best of 3: 10.4 ms per loop

>>> %timeit list(unique_everseen(a, key=set))
10 loops, best of 3: 47.7 ms per loop
>>> %timeit list(unique_everseen(a, key=chained(sorted, tuple)))
10 loops, best of 3: 22.4 ms per loop
>>> %timeit [list(tpl) for tpl in list(set([tuple(sorted(pair)) for pair in a]))]
10 loops, best of 3: 24 ms per loop
>>> %timeit set(tuple(item) for item in map(sorted, a))
10 loops, best of 3: 35 ms per loop

And with fewer duplicates:

>>> a = [[random.randint(0, 100), random.randint(0,100)] for _ in range(10000)]

>>> %timeit list(remove_reversed_duplicates(a))
100 loops, best of 3: 15.4 ms per loop
>>> %timeit list(unique_everseen(a, key=frozenset))
100 loops, best of 3: 13.1 ms per loop
>>> %timeit set(map(frozenset, a))
100 loops, best of 3: 11.8 ms per loop


>>> %timeit list(unique_everseen(a, key=set))
1 loop, best of 3: 1.96 s per loop
>>> %timeit list(unique_everseen(a, key=chained(sorted, tuple)))
10 loops, best of 3: 24.2 ms per loop
>>> %timeit [list(tpl) for tpl in list(set([tuple(sorted(pair)) for pair in a]))]
10 loops, best of 3: 31.1 ms per loop
>>> %timeit set(tuple(item) for item in map(sorted, a))
10 loops, best of 3: 36.7 ms per loop

So the variants with remove_reversed_duplicates, unique_everseen(key=frozenset) and set(map(frozenset, a)) seem to be by far the fastest solutions. Which one depends on the length of the input and the number of duplicates.

Gere answered 15/12, 2016 at 14:29 Comment(4)
In Python like anywhere, paying attention to algorithmic complexity is important. It is easy to write compact code which actually nests hidden iterations, while with a few more lines, the extra loops can be removed. So this answer is the correct solution for me and even more because I wasn't aware of the external iteration_utilities lib and had always wondered why itertools didn't include those recipes.Burchfield
@Burchfield For completeness: there are several libraries implementing some "beyond itertools" recipes: toolz, cytoolz, pydash, more-itertools (and probably many more).Gere
@Burchfield You're right. It is important... on large sets of data. Paying attention to readability and trying the simple or built in thing first is often a win, though, since much of our code operates on small enough sets that the differences don't really matter.Businessman
@Businessman or when small details matter, for example the remove_reversed_duplicates or unique_everseen yields the items as is while all map and set approaches either need to convert them back (potentially losing data and/or order) or can only applied if the conditions are met. But I agree - if a simple one-liner will do: no need to engage the full-fledged generator-function or an external library. But especially libraries often already dealt with all these minor annoying issues and fixed them.Gere
B
8

TL;DR

set(map(frozenset, lst))

Explanation

If the pairs are logically unordered, they're more naturally expressed as sets. It would be better to have them as sets before you even get to this point, but you can convert them like this:

lst = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
lst_as_sets = map(frozenset, lst)

And then the natural way of eliminating duplicates in an iterable is to convert it to a set:

deduped = set(lst_as_sets)

(This is the main reason I chose frozenset in the first step. Mutable sets are not hashable, so they can't be added to a set.)

Or you can do it in a single line like in the TL;DR section.

I think this is much simpler, more intuitive, and more closely matches how you think about the data than fussing with sorting and tuples.

Converting back

If for some reason you really need a list of lists as the final result, converting back is trivial:

result_list = list(map(list, deduped))

But it's probably more logical to leave it all as sets as long as possible. I can only think of one reason that you might need this, and that's compatibility with existing code/libraries.

Businessman answered 15/12, 2016 at 20:42 Comment(12)
+1, people (ab)using tuple(sorted(x)) when they really want frozenset(x) is a (small) pet peeve of mine.Font
This approach may randomly shuffle the remaining items because you convert it to a set. However the frozenset suggestion is pretty clever +1Gere
@Gere True, but the OP didn't specify anything about the final order. So I assumed it didn't matter. I've added a comment requesting clarification on that point.Businessman
This will also compromise order in elements themselves : [[2,1]] -> [[1,2]].Flaggy
@Flaggy Sets are by definition unordered. Any order created is implementation specific and cannot be relied upon.Businessman
Your implementation compromises order of element elements. OP's question doesn't permit that, and that may matter. Does original list contain [1,2]? No. Does that list, with all reverse duplicates removed, contain [1,2]? No seems like a logical answer to me, but your code says "Maybe?". Although we treat reverse entries as identical here, we are NOT safe to assume that they are interchangeable down the road, unless stated otherwise. OP does not say a word about sets and we can't just drop orderings without explicit permission when we get lists as input and give lists as output IMO.Flaggy
@Flaggy The OP states they are removing "duplicates." If the reverse is a "duplicate," they're equivalent and it doesn't matter. If the OP has requirements as off the wall as you describe, they've worded their question badly and need to specify details. Until I see an explicit requirement to preserve the properties you describe, the most reasonable conclusion from the wording is that such requirements don't exist.Businessman
OP's examples show preserving both inner and general order. If code down the road was expecting set of sets, not lists of lists, then the problem wouldn't even be there in the first place. You should probably add a disclaimer that this code only works if OP s fine with [[1,9], [2,2], [5,3]] -> [[2], [9,1], [5,3]], because this assumption is what your answer is based on and this assumption does not automatically follow from the question.Flaggy
"the most reasonable conclusion from the wording is that such requirements don't exist." - I think that is unduly optimistic. Your answer is a good one though for other people who might find this question via search, and not have the same requirements.Middleweight
The fact that [[2,2]] is converted to [[2]] is the most troubling issue. If the inner lists represent links in an undirected graph, then other code may have trouble if the list doesn't have two entries.Middleweight
@Flaggy I think it's unlikely that the OP even has entries that contain the same number twice. If the OP clarifies on the issue and something needs to change, I will change it then. I've already responded to the order issue. If the OP needed this, they should have said so, and if they do say so, I will handle it then. I've listened to your feedback, and that's my decision. No further comments on it are needed or useful.Businessman
@Flaggy Additionally, no, there's no definitive proof that my assertions are correct. There's also no definitive proof that they're incorrect. I think they are the most reasonable, most common sense way to understand what the OP is asking for. (The OP has unordered pairs and shows no example of relating an item to itself.) You're free to disagree, but harassing me because I don't agree with your interpretation is pointless.Businessman
D
4

You could sort each pair, convert your list of pairs to a set of tuples and back again :

l = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
[list(tpl) for tpl in list(set([tuple(sorted(pair)) for pair in l]))]
#=> [[0, 1], [1, 4], [0, 4]]

The steps might be easier to understand than a long one-liner :

>>> l = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
>>> [sorted(pair) for pair in l]
# [[0, 1], [0, 4], [0, 1], [1, 4], [0, 4], [1, 4]]
>>> [tuple(pair) for pair in _]
# [(0, 1), (0, 4), (0, 1), (1, 4), (0, 4), (1, 4)]
>>> set(_)
# set([(0, 1), (1, 4), (0, 4)])
>>> list(_)
# [(0, 1), (1, 4), (0, 4)]
>>> [list(tpl) for tpl in _]
# [[0, 1], [1, 4], [0, 4]]
Docket answered 15/12, 2016 at 13:2 Comment(2)
Nice solution but depends if order matters (because you're using set); you're assuming that the sorted() pair is one to return, which may not always be trueSolarism
The 3 pairs returned in the example are sorted, so I assumed it was okay. You're right, it could also be the first encountered pair, though.Docket
B
4

You could use the builtin filter function.

from __future__ import print_function

def my_filter(l):
    seen = set()

    def not_seen(it):
        s = min(*it), max(*it)
        if s in seen:
            return False
        else:
            seen.add(s)
            return True
        
    out = filter(not_seen, l)

    return out

myList = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
print(my_filter(myList)) # [[0, 1], [0, 4], [1, 4]]

As a complement I would orient you to the Python itertools module which describes a unique_everseen function which does basically the same thing as above but in a lazy, generator-based, memory-efficient version. Might be better than any of our solutions if you are working on large arrays. Here is how to use it:

from itertools import ifilterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

gen = unique_everseen(myList, lambda x: (min(x), max(x))) # gen is an iterator
print(gen) # <generator object unique_everseen at 0x7f82af492fa0>
result = list(gen) # consume generator into a list.
print(result) # [[0, 1], [0, 4], [1, 4]]

I haven't done any metrics to see who's fastest. However memory-efficiency and O complexity seem better in this version.

Timing min/max vs sorted

The builtin sorted function could be passed to unique_everseen to order items in the inner vectors. Instead, I pass lambda x: (min(x), max(x)). Since I know the vector size which is exactly 2, I can proceed like this.

To use sorted I would need to pass lambda x: tuple(sorted(x)) which adds overhead. Not dramatically, but still.

myList = [[random.randint(0, 10), random.randint(0,10)] for _ in range(10000)]
timeit.timeit("list(unique_everseen(myList, lambda x: (min(x), max(x))))", globals=globals(), number=20000)
>>> 156.81979029000013
timeit.timeit("list(unique_everseen(myList, lambda x: tuple(sorted(x))))", globals=globals(), number=20000)
>>> 168.8286430349999

Timings done in Python 3, which adds the globals kwarg to timeit.timeit.

Burchfield answered 15/12, 2016 at 14:49 Comment(3)
I like the unique_everseen() solution. Unique everseen is a very useful tool. One difference with other solutions is that it preserves the order of first appearance. Also the question was about a pythonic solution, and unique_everseen() is definitely pythonic!Malena
You can replace lambda x: (min(x), max(x)) with sorted in this case.Flaggy
I can't use sorted straight in, I need to pass lambda x: tuple(sorted(x)): unique_everseen needs the "inner vectors" to be hashable. Using sorted adds overhead (see edit in message). Though if the inner vecs size was unknown sorted would be needed.Burchfield
U
3

An easy and unnested solution:

pairs = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
s=set()
for p in pairs:
    # Lists are unhashable so make the "elements" into tuples
    p = tuple(p)
    if p not in s and p[::-1] not in s:
        s.add(p)

print s
Ukase answered 15/12, 2016 at 14:46 Comment(0)
Z
3

EDITED to better explain

First get each list sorted and next use the dictionaries keys to get a unique set of elements and them list comprehension.

Why tuples?
Replacing lists with tuples is necessary to avoid the "unhashable" error when passing through the fromkeys() function

my_list = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
tuple_list = [ tuple(sorted(item)) for item in my_list ]
final_list = [ list(item) for item in list({}.fromkeys(tuple_list)) ]

Using OrderedDict even preserve the list order.

from collections import OrderedDict

my_list = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
tuple_list = [ tuple(sorted(item)) for item in my_list ]
final_list = [ list(item) for item in list(OrderedDict.fromkeys(tuple_list)) ]

The above code will result in the desired list

[[0, 1], [0, 4], [1, 4]]
Zooplankton answered 15/12, 2016 at 19:2 Comment(0)
B
1

If the order of pairs and pair-items matters, creating a new list by testing for membership might be the way to go here.

pairs = [0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]
no_dups = []
for pair in pairs:
    if not any( all( i in p for i in pair ) for p in no_dups ):
        no_dups.append(pair)

Otherwise, I'd go with Styvane's answer.

Incidentally, the above solution will not work for cases in which you have matching pairs. For example, [0,0] would not be added to the list. For that, you'd need to add an additional check:

for pair in pairs:
    if not any( all( i in p for i in pair ) for p in no_dups ) or ( len(set(pair)) == 1 and not pair in no_dups ):
        no_dups.append(pair)

However, that solution will not pick up empty "pairs" (eg, []). For that, you'll need one more adjustment:

    if not any( all( i in p for i in pair ) for p in no_dups ) or ( len(set(pair)) in (0,1) and not pair in no_dups ):
        no_dups.append(pair)

The and not pair in no_dups bit is required to prevent adding the [0,0] or [] to no_dups twice.

Boat answered 15/12, 2016 at 15:21 Comment(0)
F
1

Well, I am "checking for the reverse pair and append to a list if that's not the case" as you said you could do, but I'm using a single loop.

x=[[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
out = []
for pair in x:
    if pair[::-1] not in out:
        out.append(pair)
print out

The advantage over existing answers is being, IMO, more readable. No deep knowledge of the standard library is needed here. And no keeping track of anything complex. The only concept that might be unfamiliar for beginners it that [::-1] reverts the pair.

The performance is O(n**2) though, so do not use if performance is an issue and/or lists are big.

Fermium answered 15/12, 2016 at 21:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.