Maintaining the order of the elements in a frozen set

Asked 29/8, 2017 at 10:10 Answered 29/8, 2017 at 10:55

Solved python python-2.7 pandas frozenset

I have a list of tuples, each tuple of which contains one string and two integers. The list looks like this:

x = [('a',1,2), ('b',3,4), ('x',5,6), ('a',2,1)]

The list contains thousands of such tuples. Now if I want to get unique combinations, I can do the frozenset on my list as follows:

y = set(map(frozenset, x))

This gives me the following result:

{frozenset({'a', 2, 1}), frozenset({'x', 5, 6}), frozenset({3, 'b', 4})}

I know that set is an unordered data structure and this is normal case but I want to preserve the order of the elements here so that I can thereafter insert the elements in a pandas dataframe. The dataframe will look like this:

 Name  Marks1  Marks2
0    a       1       2
1    b       3       4
2    x       5       6

Coffeecolored answered 29/8, 2017 at 10:10 Comment(4)

Do you really have situations where (a, 1, 2) is presented as (2, a, 1) and they should be considered duplicates? As it would seem that just using pd.Series(x).unique() is what you're after... – Merriemerrielle 29/8, 2017 at 10:15

Otherwise... is (a, 1, 2) definitely the same as (a, 2, 1) - if so - which one gets preserved (a, 1, 2) or (a, 2, 1)? – Merriemerrielle 29/8, 2017 at 10:16

(a,1,2) and (a,2,1) are treated as same and only the first occurence should be preserved i.e. (a,1,2) – Coffeecolored 29/8, 2017 at 10:17

Youch - so depending on the input of your data your output is going to be different... – Merriemerrielle 29/8, 2017 at 10:21

Instead of operating on the set of frozensets directly you could use that only as a helper data-structure - like in the unique_everseen recipe in the itertools section (copied verbatim):

from itertools import filterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

Basically this would solve the issue when you use key=frozenset:

>>> x = [('a',1,2), ('b',3,4), ('x',5,6), ('a',2,1)]

>>> list(unique_everseen(x, key=frozenset))
[('a', 1, 2), ('b', 3, 4), ('x', 5, 6)]

This returns the elements as-is and it also maintains the relative order between the elements.

Hood answered 29/8, 2017 at 10:21 Comment(7)

Of course, if x is a pd.Series object you can take advantage of that and use x[x.apply(frozenset).drop_duplicates().index] to take advantage of all the pandas goodness... – Merriemerrielle 29/8, 2017 at 10:31

Even though that's much shorter - it also would be slower (depending on the length of x either much slower or just a bit slower). At least if I remember correctly. – Hood 29/8, 2017 at 10:34

I get using the itertools recipe 37.3ns per loop and 12us per loop using the apply/drop_duplicates/index approach. So yeah - only a slightly difference :) – Merriemerrielle 29/8, 2017 at 10:40

Yes, I also had a ~300x improvement when using the recipe instead of pandas indexing for the short lists. For lists containing several ten-thousand tuples it's only 1.5-2 times faster. – Hood 29/8, 2017 at 10:42

That sounds more ball park from what I recall... Anyway - just throwing it out there 'cos if you're doing more than a straight dedupe at the same time, (and you've already got a series) - then keeping it in the pandas eco-system can sometimes be worth the cost. – Merriemerrielle 29/8, 2017 at 10:44

Bah... not much luck with pd.DataFrame(((*row, frozenset(row)) for row in x), columns=['M1','M2','M3','K']).drop_duplicates(subset=['K']) either - although it's not that much slower and you end up with the dataframe wanted... think I'll just give up now :) – Merriemerrielle 29/8, 2017 at 10:51

Could be useful as another answer though - given that the question is tagged with pandas :) Like you said: speed is nice but sometimes it's nicer to just use existing functions instead of rolling own functions. – Hood 29/8, 2017 at 10:52

No ordering with frozensets. You can instead create sorted tuples to check for the existence of an item, adding the original if the tuple does not exist in the set:

y = set()
lst = []
for i in x:
    t = tuple(sorted(i, key=str)
    if t not in y:
         y.add(t)
         lst.append(i)
print(lst)
# [('a', 1, 2), ('b', 3, 4), ('x', 5, 6)]

The first entry gets preserved.

Left answered 29/8, 2017 at 10:13 Comment(0)

You can do it by simple using the zip to maintain the order in the frozenset. Give this a try pls.

l = ['col1','col2','col3','col4']
>>> frozenset(l)
--> frozenset({'col2', 'col4', 'col3', 'col1'})
>>> frozenset(zip(*zip(l)))
--> frozenset({('col1', 'col2', 'col3', 'col4')})

Taking an example from the question asked:

>>> x = [('a',1,2), ('b',3,4), ('x',5,6), ('a',2,1)]
>>> frozenset(zip(*zip(x)))
--> frozenset({(('a', 1, 2), ('b', 3, 4), ('x', 5, 6), ('a', 2, 1))})

Telegenic answered 29/8, 2017 at 10:10 Comment(0)

There are some quite useful functions in NumPy which can help you to solve this problem.

import numpy as np
chrs, indices = np.unique(list(map(lambda x:x[0], x)), return_index=True)
chrs, indices
>> (array(['a', 'b', 'x'], 
   dtype='<U1'), array([0, 1, 2]))
[x[indices[i]] for i in range(indices.size)]
>> [('a', 1, 2), ('b', 3, 4), ('x', 5, 6)]

Zlatoust answered 29/8, 2017 at 10:55 Comment(0)

Recommended topics

Hot tags