FAST unique combinations (from list with duplicates) WITHOUT LOOKUPS
Asked Answered
A

2

9

In spite of the fact that there are online plenty of algorithms and functions for generating unique combinations of any size from a list of unique items, there is none available in case of a list of non-unique items (i.e. list containing repetitions of same value.)

The question is how to generate ON-THE-FLY in a generator function all the unique combinations from a non-unique list without the computational expensive need of filtering out duplicates?

I consider combination comboA to be unique if there is no other combination comboB for which sorted lists for both combinations are the same. Let's give an example of code checking for such uniqueness:

comboA = [1,2,2]
comboB = [2,1,2]
print("B is a duplicate of A" if sorted(comboA)==sorted(comboB) else "A is unique compared to B")

In the above given example B is a duplicate of A and the print() prints B is a duplicate of A.

The problem of getting a generator function capable of providing unique combinations on-the-fly in case of a non-unique list is solved here: Getting unique combinations from a non-unique list of items, FASTER?, but the provided generator function needs lookups and requires memory what causes problems in case of a huge amount of combinations.

The in the current version of the answer provided function does the job without any lookups and appears to be the right answer here, BUT ...

The goal behind getting rid of lookups is to speed up the generation of unique combinations in case of a list with duplicates.

I have initially (writing the first version of this question) wrongly assumed that code which doesn't require creation of a set used for lookups needed to assure uniqueness is expected to give an advantage over code needing lookups. It is not the case. At least not always. The code in up to now provided answer does not using lookups, but is taking much more time to generate all the combinations in case of no redundant list or if only a few redundant items are in the list.

Here some timings to illustrate the current situation:

-----------------
 k: 6 len(ls): 48
Combos   Used Code                               Time
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.036 seconds
12271512 len(list(subbags(ls,k)))            : 50.540 seconds
12271512 len(list(uniqueCombinations(ls,k))) :  8.174 seconds
12271512 len(set(combinations(sorted(ls),k))):  7.233 seconds
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.030 seconds
       1 len(list(subbags(ls,k)))            :  0.001 seconds
       1 len(list(uniqueCombinations(ls,k))) :  3.619 seconds
       1 len(set(combinations(sorted(ls),k))):  2.592 seconds

Above timings illustrate the two extremes: no duplicates and only duplicates. All other timings are between this two.

My interpretation of the results above is that a pure Python function (not using any C-compiled modules) can be extremely faster, but it can be also much slower depending on how many duplicates are in a list. So there is probably no way around writing C/C++ code for a Python .so extension module providing the required functionality.

Arnaldo answered 7/4, 2017 at 16:47 Comment(4)
How do you determine which of (1,2,2) and (2,1,2) are "correct"?Tetrabranchiate
Your first comment was what I was looking for.Tetrabranchiate
@Claudio I also found this thread which contains code for a much simpler iterative algorithm (requires sorting input) and also a recursive algorithm. They seem quite a bit more efficient than the current answers, but I haven't really tested them.Ingenuous
@lazydog See here cython.org and here #43729552 and feel free to provide another answer if you like with the ready to use module faster then the best of current answers. The recursion algo from your answer gives already a very good C-compiled Python module which is only a tiny bit slower than the Cython-optimized code of an iterator class version using the algo with loops instead of recursion. Sorry - haven't yet found the time to update the question to the state-of-the-art on this subject.Arnaldo
I
4

Instead of post-processing/filtering your output, you can pre-process your input list. This way, you can avoid generating duplicates in the first place. Pre-processing involves either sorting (or using a collections.Counter on) the input. One possible recursive realization is:

def subbags(bag, k):
    a = sorted(bag)
    n = len(a)
    sub = []

    def index_of_next_unique_item(i):
        j = i + 1

        while j < n and a[j] == a[i]:
            j += 1

        return j

    def combinate(i):
        if len(sub) == k:
            yield tuple(sub)
        elif n - i >= k - len(sub):
            sub.append(a[i])
            yield from combinate(i + 1)
            sub.pop()
            yield from combinate(index_of_next_unique_item(i))

    yield from combinate(0)

bag = [1, 2, 3, 1, 2, 1]
k = 3
i = -1

print(sorted(bag), k)
print('---')

for i, subbag in enumerate(subbags(bag, k)):
    print(subbag)

print('---')
print(i + 1)

Output:

[1, 1, 1, 2, 2, 3] 3
---
(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 2, 2)
(1, 2, 3)
(2, 2, 3)
---
6

Requires some stack space for the recursion, but this + sorting the input should use substantially less time + memory than generating and discarding repeats.

Ingenuous answered 20/4, 2017 at 12:26 Comment(3)
Unfortunately, i'm not familiar with the Python/C api, and i don't have a lot free time until Sunday night. I'll try and look into it later, and also try to develop an iterative algorithm as well, unless someone else wants to beat me to it.Ingenuous
@Claudio so it looks like i don't have the proper environment to build C-extension modules, and setting it up (on Windows) seems quite involved. i can't write a C (or C++) extension if i can't run and test it incrementally. Sorry. i also wrote an iterative Python algorithm that's similar to the Python equivalent of itertools.combinations_with_replacement given in the documentation, but it is quite ugly and not significantly faster than the recursive code, at best.Ingenuous
@Claudio: In the comment you deleted you asked lazy dog to write a C/C++ module for you. That is not acceptable, please don't do it again.Baptiste
A
1

The current state-of-the-art inspired initially by a 50 than by a 100 reps bounties is at the moment (instead of a Python extension module written entirely in C):

An efficient algorithm and implementation that is better than the obvious (set + combinations) approach in the best (and average) case, and is competitive with it in the worst case.

It seems to be possible to fulfill this requirement using a kind of "fake it before you make it" approach. The current state-of-the-art is that there are two generator function algorithms available for solving the problem of getting unique combinations in case of a non-unique list. The below provided algorithm combines both of them what becomes possible because it seems to exist a threshold value for percentage of unique items in the list which can be used for appropriate switching between the two algorithms. The calculation of the percentage of uniqueness is done with so tiny amount of computation time that it even doesn't clearly show up in the final results due to common variation of the taken timing.

def iterFastUniqueCombos(lstList, comboSize, percUniqueThresh=60):

    lstListSorted = sorted(lstList)
    lenListSorted = len(lstListSorted)

    percUnique = 100.0 - 100.0*(lenListSorted-len(set(lstListSorted)))/lenListSorted

    lstComboCandidate = []
    setUniqueCombos = set()

    def idxNextUnique(idxItemOfList):
        idxNextUniqueCandidate = idxItemOfList + 1
        while (
                idxNextUniqueCandidate < lenListSorted 
                    and 
                lstListSorted[idxNextUniqueCandidate] == lstListSorted[idxItemOfList]
        ): # while
            idxNextUniqueCandidate += 1
        idxNextUnique = idxNextUniqueCandidate
        return idxNextUnique

    def combinate(idxItemOfList):
        if len(lstComboCandidate) == sizeOfCombo:
            yield tuple(lstComboCandidate)
        elif lenListSorted - idxItemOfList >= sizeOfCombo - len(lstComboCandidate):
            lstComboCandidate.append(lstListSorted[idxItemOfList])
            yield from combinate(idxItemOfList + 1)
            lstComboCandidate.pop()
            yield from combinate(idxNextUnique(idxItemOfList))

    if percUnique > percUniqueThresh:
        from itertools import combinations
        allCombos = combinations(lstListSorted, comboSize)
        for comboCandidate in allCombos:
            if comboCandidate in setUniqueCombos:
                continue
            yield comboCandidate
            setUniqueCombos.add(comboCandidate)
    else:
        yield from combinate(0)
    #:if/else    
#:def iterFastUniqueCombos()

The below provided timings show that the above iterFastUniqueCombos() generator function provides a clear advantage over uniqueCombinations() variant in case the list has less than 60 percent of unique elements and is not worse as the on (set + combinations) based uniqueCombinations() generator function in the opposite case where it gets much faster than the iterUniqueCombos() one (due to switching between the (set + combinations) and the (no lookups) variant at 60% threshold for amount of unique elements in the list):

===========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 1   percUnique   2
Combos: 12271512  print(len(list(combinations(lst,k))))           :   2.04968 seconds.
Combos:        1  print(len(list(      iterUniqueCombos(lst,k)))) :   0.00011 seconds.
Combos:        1  print(len(list(  iterFastUniqueCombos(lst,k)))) :   0.00008 seconds.
Combos:        1  print(len(list(    uniqueCombinations(lst,k)))) :   3.61812 seconds.

==========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 48   percUnique 100
Combos: 12271512  print(len(list(combinations(lst,k))))           :   1.99383 seconds.
Combos: 12271512  print(len(list(      iterUniqueCombos(lst,k)))) :  49.72461 seconds.
Combos: 12271512  print(len(list(  iterFastUniqueCombos(lst,k)))) :   8.07997 seconds.
Combos: 12271512  print(len(list(    uniqueCombinations(lst,k)))) :   8.11974 seconds.

==========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 27   percUnique  56
Combos: 12271512  print(len(list(combinations(lst,k))))           :   2.02774 seconds.
Combos:   534704  print(len(list(      iterUniqueCombos(lst,k)))) :   1.60052 seconds.
Combos:   534704  print(len(list(  iterFastUniqueCombos(lst,k)))) :   1.62002 seconds.
Combos:   534704  print(len(list(    uniqueCombinations(lst,k)))) :   3.41156 seconds.

==========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 31   percUnique  64
Combos: 12271512  print(len(list(combinations(lst,k))))           :   2.03539 seconds.
Combos:  1114062  print(len(list(      iterUniqueCombos(lst,k)))) :   3.49330 seconds.
Combos:  1114062  print(len(list(  iterFastUniqueCombos(lst,k)))) :   3.64474 seconds.
Combos:  1114062  print(len(list(    uniqueCombinations(lst,k)))) :   3.61857 seconds.
Arnaldo answered 26/4, 2017 at 10:52 Comment(1)
i'm not sure how to rewrite the recursive one to an equivalent iterative version, but here is an alternative iterative algorithm that might be easier to adapt as a C-extension. The source code for itertools.combinations_with_replacement might be helpful. Still, I think the combined, meta-algorithm approach in your answer is simpler and effective. It's similar to the standard quick sort+insertion sort combo.Ingenuous

© 2022 - 2024 — McMap. All rights reserved.