Getting unique combinations from a non-unique list of items, FASTER?
Asked Answered
A

2

4

First off, I am able to do it but I am not happy with the speed.

My question is, Is there a better, faster way of doing this?

I have a list of items looking like this:

[(1,2), (1,2), (4,3), (7,8)]

And I need to get all the unique combinations. For example, the unique combinations of 2 items would be:

[(1,2), (1,2)], [(1,2), (4,3)], [(1,2), (7,8)], [(4,3), (7,8)]

After using itertools.combinations I get a lot more than that because of duplicates. For example, I get every list containing (1,2) twice. If I create a set of these combinations I get the unique ones. The problem comes when the original list has 80 tuples and I want combinations with 6 items in them. Getting that set takes more than 30 seconds. If I can get that number down I would be very happy.

I am aware that the number of combinations is huge and that's why creating the set is time-consuming. But I am still hoping that there is a library that has optimized the process in some way, speeding it up a bit.

It might be important to note that from all the combinations I find I test out only the first 10000 or so. Because in some cases all combos can be waay too much to process so I don't really want to spend too much time on them as there are other tests to be done too.

This is a sample of what I have now:

from itertools import combinations

ls = [list of random NON-unique sets (x,y)]
# ls = [(1,2), (1,2), (4,3), (7,8)]  # example
# in the second code snipped it is shown how I generate ls for testing

all_combos = combinations(ls, 6)
all_combos_set = set(all_combos)

for combo in all_combos_set:
  do_some_test_on(combo)

In case you want to test it out .. here is what I use for testing the speed of different methods:

def main3():
    tries = 4
    elements_in_combo = 6
    rng = 90
    data = [0]*rng
    for tr in range(tries):
        for n in range(1, rng):
            quantity = 0
            name = (0,0)
            ls = []
            for i in range(n):
                if quantity == 0:
                    quantity = int(abs(gauss(0, 4)))
                    if quantity != 0:
                        quantity -= 1
                    name = (randint(1000,7000), randint(1000,7000))
                    ls.append(name)
                else:
                    quantity -= 1
                    ls.append(name)

            start_time = time.time()
            all_combos = combinations(ls, elements_in_combo)
            all_combos = set(all_combos)

            duration = time.time() - start_time
            data[n] += duration
            print(n, "random files take", duration, "seconds.")

            if duration > 30:
                break

    for i in range(rng):
        print("average duration for", i, "is", (data[i]/tries), "seconds.")
Acklin answered 2/4, 2017 at 13:21 Comment(9)
#1208618Mciver
I have tried googling "combinations numpy" too.Acklin
For combinations of two items: {((1, 2), (4, 3)), ((4, 3), (7, 8)), ((1, 2), (1, 2)), ((7, 8), (7, 8)), ((1, 2), (7, 8))}. For combinations of 3 items: {((4, 3), (7, 8), (7, 8)), ((1, 2), (4, 3), (7, 8)), ((1, 2), (7, 8), (7, 8)), ((1, 2), (1, 2), (1, 2)), ((1, 2), (1, 2), (4, 3)), ((1, 2), (1, 2), (7, 8))}Acklin
For what I'm using this code I would love to go over all possible combinations and any number of items. If I have N numbers I want all combinations with 2, 3, 4... N items. I found out that my time limits hit around 6. That's why I'm testing with 6 in the code.Acklin
@Acklin see my updated answer and check the link at the very bottom of it for another generator function which is fast in case of many duplicates and slow in case of only few duplicates.Courtnay
@Acklin by the way: you can offer a bounty for any question you like to see a good answer to, so feel free to do it for the question #43283825 in order to get what you want (there is already a useful answer there, but still not the best one possible).Courtnay
@Claudio my input list is sorted so I have no issues as the ones you discussed. Sorry I missed to point that out earlier. Also, your answer was really all I needed. Things worked out nice.Acklin
@Acklin ok - got it . I consider the question now as "closed" and will delete all my comments here, because they are not of interest for future visitors of the question, so you can delete yours too as some of them doesn't make sense if mine are no more there. In a long run I will probably try to put the for combinations from a list with duplicates without lookups found algorithm into Python standard lib to achieve a real speedup on creating unique combinations, but this is another story. See you maybe on next question/answer here (y) :) .Courtnay
By the way: I have got on my question related to combinations currently already more visits as this question has - the bounty does a good job if you want to draw attention ... :D .Courtnay
C
5

The originally asked question "is there a better, faster way of doing this?" has actually two questions in it:

  • Is there a faster way?
  • Is there a better way?

I would like to narrow the answer to the question "Is there a faster way?" to:

Is there a FASTER way of removing duplicates from a list as doing it as follows:

lstWithUniqueElements = list(set(lstWithDuplicates))

?

To my knowledge, there is no faster way ...

Now let's concentrate more on the second part of the question ( "Is there a better way?" ). It is usually very hard and needs much discussion to answer such kind of question, but it will be not the case here, because what a better way is, was already clearly stated by the author of the question (citation):

I'd love to use a generator function. The itertools combinations() itself is an iterable and not a list or set, so if I figure out how to yield the unique combinations that'd be great.

So HERE it is :

def uniqueCombinations(lstList, comboSize): 
    from itertools import combinations
    lstList.sort()
    allCombos = combinations(lstList, comboSize)
    setUniqueCombos = set()
    for comboCandidate in allCombos:
        if comboCandidate in setUniqueCombos:
            continue
        yield comboCandidate
        setUniqueCombos.add(comboCandidate)

That's it ...


One more thing is maybe worth to mention here. The by the author of the question chosen method of getting unique combinations in case the list they are generated from has not only unique but also multiple elements with same value doesn't work in some special cases like this one:
set(combinations(['a','a','b','a'], 2)) gives: {('a', 'b'), ('b', 'a'), ('a', 'a')}
uniqueCombinations(['a','a','b','a'],2) gives: {('a', 'b'), ('a', 'a')}

In between there is a pure Python function available here on stackoverflow which is both faster and slower as this one provided above. How can it be faster AND slower? See HERE for details.

Courtnay answered 5/4, 2017 at 22:24 Comment(3)
I removed my comments so if anyone has real feedback he can use the comments. Thanks!Fowle
I don't know what kind of black magic that was but uniqueCombinations works really good. It looks exactly the same as: all_combos = combinations(ls, elements_in_combo) bs = set() enough = 0 for combo in all_combos: if combo in bs: continue bs.add(combo) if enough > 10000: break But one outperforms the other by a lot.Acklin
The black magic part was that I was stupid and I missed the increment of "enough". I must have been blind sighted by desperation :D Thanks you again for all the time spent on this.Acklin
A
1

I guess this answer comes well after the OP needs it, but I run into the same issue and I would like to contribute my solution. I wanted to store no combination in the memory, because it is easy to see how this can go wrong.

First, this link provides a very clear explanation about how to calculate the number of distinct combinations when elements are repeated. The strategy is to create combinations with replacement and then discard the combinations that are not allowed.

For example, if the collection is (A, A, B, B) and you want all combinations of 3 elements, the combinations (A, A, A) and (B, B, B) are not allowed. Therefore, the idea is to create all possible combinations with replacement from the list of unique elements in the original sets and then discard those combinations that are not valid. This does not occupy the memory with any lookup and is easy to write.

However, this strategy is wasteful when we have sets with many unique elements. Taking this issue to its extreme, the only 3-elements-long combination from the set (A, B, C) is clearly (A, B, C), but this strategy would produce (A, A, A), (A, A, B), ... To alleviate this issue, one may notice that a unique element can only appear once in a valid combination: with unique elements, the standard itertools.combinations() will do.

Therefore, if we have a mixture of unique and repeated element, the final combinations can be split in a part that is generated from unique elements with itertools.combinations() and a part that is generated from itertools.combinations_with_replacement() for repeated elements.

All in all, this is the code. How fast it runs really depends on the amount of repetition in the original collection. The worst case scenario is the one without repetition:

import itertools
from collections import Counter

#Check if an element is repeated more times than allowed.
def comb_check(comb, original_dic):
    trouble = False
    if not comb:
        return(not trouble)
    comb_unique = set(comb)
    ratio = len(comb_unique)/len(comb)
    if ratio < 1:
       comb = Counter(comb)
       ks = (v for v in comb_unique)
       complete = False
       while (not trouble) and (not complete):
           try:
               k = next(ks)
               if comb[k] > 1:
                   if original_dic[k] < comb[k]: trouble = True
           except StopIteration:
               complete = True
    return(not trouble)

def generate_comb(elements,k):
    elements = Counter(elements)
    elements_unique = [k for k,v in elements.items() if v == 1]
    elements_other = [k for k, v in elements.items() if k not in elements_unique]
    max_repetition = sum([elements[k] for k in elements_other ])
    for n in range(0, min(k+1,len(elements_unique)+1)):
        if (n + max_repetition)>= k:
            for i in itertools.combinations(elements_unique, n):
                for j in itertools.combinations_with_replacement(elements_other, k-n):
                    if comb_check(j, elements):
                        (yield  j)

#All unique elements is the worst case when it comes to time
lst = [a for a in range(80)]
for k in generate_comb(lst, 6):
    pass
#It took my machine ~ 264 sec to run this

#Slightly better
lst = [a for a in range(40)] + [a for a in range(40)]
for k in generate_comb(lst, 6):
    pass
#It took my machine ~ 32 sec to run this
Aboveground answered 9/7, 2019 at 22:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.