Picking unordered combinations from pools with overlap
Asked Answered
U

6

25

I have pools of values and I would like to generate every possible unordered combination by picking from certain pools.

For example, I wanted to pick from pool 0, pool 0, and pool 1:

>>> pools = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
>>> part = (0, 0, 1)
>>> list(product(*(pools[i] for i in part)))
[(1, 1, 2), (1, 1, 3), (1, 1, 4), (1, 2, 2), (1, 2, 3), (1, 2, 4), (1, 3, 2), (1, 3, 3), (1, 3, 4), (2, 1, 2), (2, 1, 3), (2, 1, 4), (2, 2, 2), (2, 2, 3), (2, 2, 4), (2, 3, 2), (2, 3, 3), (2, 3, 4), (3, 1, 2), (3, 1, 3), (3, 1, 4), (3, 2, 2), (3, 2, 3), (3, 2, 4), (3, 3, 2), (3, 3, 3), (3, 3, 4)]

This generates every possible combination by picking from pool 0, pool 0, and pool 1.

However order doesn't matter to me, so many of the combinations are actually duplicates. For example, since I used a Cartesian product, both (1, 2, 4) and (2, 1, 4) are generated.

I came up with a simple method to mitigate this issue. For members picked from a single pool, I select without ordering using combinations_with_replacement. I count how many times I want to draw from each pool. The code looks like this:

cnt = Counter()
for ind in part: cnt[ind] += 1
blocks = [combinations_with_replacement(pools[i], cnt[i]) for i in cnt]
return [list(chain(*combo)) for combo in product(*blocks)]

This reduces ordering duplicates if I happen to choose from the same pool multiple times. However all the pools have lots of overlap, and using combinations_with_replacement on multiple pools merged would generate some invalid combinations. Is there a more efficient method to generate unordered combinations?

Edit: Extra info about the inputs: The number of parts and pools is small (~5 and ~20), and for simplicity, each element is an integer. The actual problem I have already solved so this is just for academic interest. Let's say there are thousands hundreds of integers in each pool but some pools are small and only have dozens. So some kind of union or intersection seems to be the way to go.

Unchaste answered 14/8, 2018 at 5:47 Comment(7)
I think you can combine your pools into a single pool?Pashalik
I suspect counting the number of possibilities may be ♯P-complete.Vomitory
How many combinations do you expect? In your example you collect them in a list; could you in practice also collect them in a set after sorting them? Or do you have to generate them?Mother
@eddiewould, I'm not sure about that. I thought the same thing.... I tried generating all combinations of the combined pool choose the number of pools using libraries capable of generating combinations of multisets and found out that this method still produces many duplicates and worse you end up getting results that don't exist in the Cartesian product.Hopper
@qwr, how long does your current solution take on the test case I gave below?Hopper
Hey... even in the best possible case (all pools identical), with the input sizes you're talking about, you're never going to get through all these combinations no matter how you generate them. Whatever problem you're trying to solve by generating these combinations, you need to find a better way to solve it.Vomitory
@Vomitory well some pools are very small, particularly those with low index. The context is that this is an idea for a Project Euler problem which I've solved with a different method so now it is for curiosity.Unchaste
S
12

This isn't "an answer" so much as a spur to think harder ;-) For concreteness, I'll wrap the OP's code, slightly respelled, in a function that also weeds out duplicates:

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import chain, product
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    seen = set()
    cnt = Counter(ixs) # map index to count
    blocks = [cwr(pools[i], count) for i, count in cnt.items()]
    for t in product(*blocks):
        t = tuple(sorted(chain(*t)))
        if t not in seen:
            seen.add(t)
            yield t

I don't fear sorting here - it's memory-efficient, and for small tuples is likely faster than all the overheads involved in creating a Counter object.

But regardless of that, the point here is to emphasize the real value the OP got by reformulating the problem to use combinations_with_replacement (cwr). Consider these inputs:

N = 64
pools = [[0, 1]]
ixs = [0] * N

There are only 65 unique results, and the function generates them instantaneously, with no internal duplicates at all. On the other hand, the essentially identical

pools = [[0, 1]] * N
ixs = range(N)

also has the same 65 unique results, but essentially runs forever (as would, e.g, the other answers so far given), slogging through 2**64 possibilities. cwr doesn't help here because each pool index appears only once.

So there's astronomical room for improvement over any solution that "merely" weeds out duplicates from a full Cartesian product, and some of that can be won by doing what the OP already did.

It seems to me the most promising approach would be to write a custom generator (not one relying primarily on itertools functions) that generated all possibilities in lexicographic order to begin with (so, by construction, no duplicates would be created to begin with). But that requires some "global" analysis of the input pools first, and the code I started on quickly got more complex than I can make time to wrestle with now.

One based on @user2357112's answer

Combining cwr with @user2357112's incremental de-duplicating gives a brief algorithm that runs fast on all the test cases I have. For example, it's essentially instantaneous for both spellings of the [0, 1] ** 64 examples above, and runs the example at the end of @Joseph Wood's answer approximately as fast as he said his C++ code ran (0.35 seconds on my box under Python 3.7.0, and, yes, found 162295 results):

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    result = {()}
    for i, count in Counter(ixs).items():
        result = {tuple(sorted(old + new))
                  for new in cwr(pools[i], count)
                  for old in result}
    return result

To make it easier for other Pythonistas to try the last example, here's the input as executable Python:

pools = [[1, 10, 14, 6],
         [7, 2, 4, 8, 3, 11, 12],
         [11, 3, 13, 4, 15, 8, 6, 5],
         [10, 1, 3, 2, 9, 5,  7],
         [1, 5, 10, 3, 8, 14],
         [15, 3, 7, 10, 4, 5, 8, 6],
         [14, 9, 11, 15],
         [7, 6, 13, 14, 10, 11, 9, 4]]
ixs = range(len(pools))

However, the OP later added that they typically have about 20 pools, each with some thousands of elements. 1000**20 = 1e60 is waaaaay out of practical reach for any approach that builds the full Cartesian product, no matter how cleverly it weeds out duplicates. It remains clear as mud how many they expect to be duplicates, though, so also clear as mud whether this kind of "incremental de-duplicating" is good enough to be practical.

Ideally we'd have a generator that yielded one result at a time, in lexicographic order.

Lazy lexicographic one-at-a-time generation

Building on the incremental de-duplication, suppose we have a strictly increasing (lexicographic) sequence of sorted tuples, append the same tuple T to each, and sort each again. Then the derived sequence is still in strictly increasing order. For example, in the left column we have the 10 unique pairs from range(4), and in the right column what happens after we append (and sort again) 2 to each:

00 002
01 012
02 022
03 023
11 112
12 122
13 123
22 222
23 223
33 233

They started in sorted order, and the derived triples are also in sorted order. I'll skip the easy proof (sketch: if t1 and t2 are adjacent tuples, t1 < t2, and let i be the smallest index such that t1[i] != t2[i]. Then t1[i] < t2[i] (what "lexicographic <" means). Then if you throw x into both tuples, proceed by cases: is x <= t1[i]? between t1[i] and t2[i]? is x >= t2[i]? In each case it's easy to see that the first derived tuple remains strictly less then the second derived tuple.)

So supposing we have a sorted sequence result of all unique sorted tuples from some number of pools, what happens when we add elements of a new pool P into the tuples? Well, as above,

[tuple(sorted(old + (P[0],))) for old in result]

is also sorted, and so is

[tuple(sorted(old + (P[i],))) for old in result]

for all i in range(len(P)). These guaranteed already-sorted sequences can be merged via heapq.merge(), and another generator (killdups() below) run on the merge result to weed out duplicates on the fly. There's no need to, e.g., keep a set of all tuples seen so far. Because the output of the merge is non-decreasing, it's sufficient just to check whether the next result is the same as the last result output.

Getting this to work lazily is delicate. The entire result-so-far sequence has to be accessed by each element of the new pool being added, but we don't want to materialize the whole thing in one gulp. Instead itertools.tee() allows each element of the next pool to traverse the result-so-far sequence at its own pace, and automatically frees memory for each result item after all new pool elements have finished with it.

The function build1() (or some workalike) is needed to ensure that the right values are accessed at the right times. For example, if the body of build1() is pasted in inline where it's called, the code will fail spectacularly (the body would access the final values bound to rcopy and new instead of what they were bound to at the time the generator expression was created).

In all, of course this is somewhat slower, due to layers of delayed generator calls and heap merges. In return, it returns the results in lexicographic order, can start delivering results very quickly, and has lower peak memory burden if for no other reason than that the final result sequence isn't materialized at all (little is done until the caller iterates over the returned generator).

Tech note: don't fear sorted() here. The appending is done via old + new for a reason: old is already sorted, and new is typically a 1-tuple. Python's sort is linear-time in this case, not O(N log N).

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import tee
    from collections import Counter
    from heapq import merge

    def killdups(xs):
        last = None
        for x in xs:
            if x != last:
                yield x
                last = x

    def build1(rcopy, new):
        return (tuple(sorted(old + new)) for old in rcopy)

    assert all(0 <= i < len(pools) for i in ixs)
    result = [()]
    for i, count in Counter(ixs).items():
        poolelts = list(cwr(pools[i], count))
        xs = [build1(rcopy, new)
              for rcopy, new in zip(tee(result, len(poolelts)),
                                    poolelts)]
        result = killdups(merge(*xs))
    return result

2 inputs

Turns out that for the 2-input case there's an easy approach derived from set algebra. If x and y are the same, cwr(x, 2) is the answer. If x and y are disjoint, product(x, y). Else the intersection c of x and y is non-empty, and the answer is the catenation of 4 cross-products obtained from the 3 pairwise-disjoint sets c, x-c, and y-c: cwr(c, 2), product(x-c, c), product(y-c, c), and product(x-c, y-c). Proof is straightforward but tedious so I'll skip it. For example, there are no duplicates between cwr(c, 2) and product(x-c, c) because every tuple in the latter contains an element from x-c, but every tuple in the former contains elements only from c, and x-c and c are disjoint by construction. There are no duplicates within product(x-c, y-c) because the two inputs are disjoint (if they contained an element in common, that would have been in the intersection of x and y, contradicting that x-c has no element in the intersection). Etc.

Alas, I haven't found a way to generalize this beyond 2 inputs, which surprised me. It can be used on its own, or as a building block in other approaches. For example, if there are many inputs, they can be searched for pairs with large intersections, and this 2-input scheme used to do those parts of the overall products directly.

Even at just 3 inputs, it's not clear to me how to get the right result for

[1, 2], [2, 3], [1, 3]

The full Cartesian product has 2**3 = 8 elements, only one of which repeats: (1, 2, 3) appears twice (as (1, 2, 3) and again as (2, 3, 1)). Each pair of inputs has a 1-element intersection, but the intersection of all 3 is empty.

Here's an implementation:

def pair(x, y):
    from itertools import product, chain
    from itertools import combinations_with_replacement
    x = set(x)
    y = set(y)
    c = x & y
    chunks = []
    if c:
        x -= c
        y -= c
        chunks.append(combinations_with_replacement(c, 2))
        if x:
            chunks.append(product(x, c))
        if y:
            chunks.append(product(y, c))
    if x and y:
        chunks.append(product(x, y))
    return chain.from_iterable(chunks)

A Proof-of-Concept Based on Maximal Matching

This blends ideas from @Leon's sketch and an approach @JosephWoods sketched in comments. It's not polished, and can obviously be sped up, but it's reasonably quick on all the cases I tried. Because it's rather complex, it's probably more useful to post it in an already-hard-enough-to-follow un-optimized form anyway!

This doesn't make any attempt to determine the set of "free" pools (as in @Leon's sketch). Primarily because I didn't have code sitting around for that, and partly because it wasn't immediately clear how to accomplish that efficiently. I did have code sitting around to find a match in a bipartite graph, which required only a few changes to use in this context.

So this tries plausible result prefixes in lexicographic order, as in @JosephWood's sketch, and for each sees whether it's actually possible to construct via checking whether a bipartite-graph match exists.

So while the details of @Leon's sketch are largely unimplemented here, the visible behaviors are much the same: it produces results in lexicographic order, it doesn't need to check for duplicates, it's a lazy generator, peak memory use is proportional to the sum of the lengths of the pools, it can obviously be parallelized in many ways (set different processes to work on different regions of the result space), and the key to making it majorly faster lies in reducing the massive amounts of redundant work done by the graph-matching function (a great deal of what it does on each call merely reproduces what it did on the previous call).

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import chain, repeat, islice

    elt2pools = defaultdict(set)
    npools = 0
    for i, count in Counter(ixs).items():
        set_indices = set(range(npools, npools + count))
        for elt in pools[i]:
            elt2pools[elt] |= set_indices
        npools += count
    elt2count = {elt : len(ps) for elt, ps in elt2pools.items()}

    cands = sorted(elt2pools.keys())
    ncands = len(cands)

    result = [None] * npools

    # Is it possible to match result[:n] + [elt]*count?
    # We already know it's possible to match result[:n], but
    # this code doesn't exploit that.
    def match(n, elt, count):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y in y2x:
                        if extend(y2x[y], seen):
                            y2x[y] = x
                            return True
                    else:
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in chain(islice(result, n), repeat(elt, count)):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)
        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return False
        return True

    def inner(i, j): # fill result[j:] with elts from cands[i:]
        if j >= npools:
            yield tuple(result)
            return
        for i in range(i, ncands):
            elt = cands[i]
            # Find the most times `elt` can be added.
            count = min(elt2count[elt], npools - j)
            while count:
                if match(j, elt, count):
                    break
                count -= 1
            # Since it can be added `count` times, it can also
            # be added any number of times less than `count`.
            for k in range(count):
                result[j + k] = elt
            while count:
                yield from inner(i + 1, j + count)
                count -= 1

    return inner(0, 0)

EDIT: note that there's a potential trap here, illustrated by the pair of pools range(10_000) and range(100_000). After producing (9999, 99999), the first position increments to 10000, and then it continues for a very long time deducing that there's no match for any of the possibilities in 10001 .. 99999 in the second position; and then for 10001 in the first position no match for any of the possibilities in 10002 .. 99999 in the second position; and so on. @Leon's scheme instead would have noted that range(10_000) was the only free pool remaining having picked 10000 in the first position, and noted at once then that range(10_000) doesn't contain any values greater than 10000. It would apparently need to do that again for 10001, 10002, ..., 99999 in the first position. That's a linear-time rather than quadratic-time waste of cycles, but a waste all the same. Moral of the story: don't trust anything until you have actual code to try ;-)

And One Based on @Leon's Scheme

Following is a more-than-less faithful implementation of @Leon's ideas. I like the code better than my "proof of concept" (POC) code just above, but was surprised to find that the new code runs significantly slower (a factor of 3 to 4 times slower on a variety of cases akin to @JospephWood's randomized example) relative to a comparably "optimized" variant of the POC code.

The primary reason appears to be more calls to the matching function. The POC code called that once per "plausible" prefix. The new code doesn't generate any impossible prefixes, but for each prefix it generates may need to make multiple match() calls to determine the possibly smaller set of free pools remaining. Perhaps there's a cleverer way to do that.

Note that I added one twist: if a free pool's elements are all smaller than the last element of the prefix so far, it remains "a free pool" with respect to the prefix, but it's useless because none of its elements can appear in the candidates. This doesn't matter to the outcome, but it means the pool remains in the set of free pools for all remaining recursive calls, which in turn means they can waste time determining that it's still a "free pool". So when a free pool can no longer be used for anything, this version removes it from the set of free pools. This gave a significant speedup.

Note: there are many ways to try matching, some of which have better theoretical O() worst-case behavior. In my experience, simple depth-first (as here) search runs faster in real life on typical cases. But it depends very much on characteristics of what "typical" graphs look like in the application at hand. I haven't tried other ways here.

Bottom lines, ignoring the "2 inputs" special-case code:

  • Nothing here beats incremental de-duplication for speed, if you have the RAM. But nothing is worse than that for peak memory burden.

  • Nothing beats the matching-based approaches for frugal memory burden. They're in an entirely different universe on that measure. They're also the slowest, although at least in the same universe ;-)

The code:

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(list)
    allpools = []
    npools = 0
    for i, count in Counter(ixs).items():
        indices = list(range(npools, npools + count))
        plist = sorted(pools[i])
        for elt in plist:
            elt2pools[elt].extend(indices)
        for i in range(count):
            allpools.append(plist)
        npools += count
    pools = allpools
    assert npools == len(pools)

    result = [None] * npools

    # Is it possible to match result[:n] not using pool
    # bady?  If not, return None.  Else return a matching,
    # a dict whose keys are pool indices and whose values
    # are a permutation of result[:n].
    def match(n, bady):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x and y != bady:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        for x in freexs:
            if not extend(x, {bady}):
                return None
        return y2x

    def inner(j, freepools): # fill result[j:]
        from bisect import bisect_left
        if j >= npools:
            yield tuple(result)
            return
        if j:
            new_freepools = set()
            allcands = set()
            exhausted = set()  # free pools with elts too small
            atleast = result[j-1]
            for pi in freepools:
                if pi not in new_freepools:
                    m = match(j, pi)
                    if not m:  # match must use pi
                        continue
                    # Since `m` is a match to result[:j],
                    # any pool in freepools it does _not_
                    # use must still be free.
                    new_freepools |= freepools - m.keys()
                    assert pi in new_freepools
                # pi is free with respect to result[:j].
                pool = pools[pi]
                if pool[-1] < atleast:
                    exhausted.add(pi)
                else:
                    i = bisect_left(pool, atleast)
                    allcands.update(pool[i:])
            if exhausted:
                freepools -= exhausted
                new_freepools -= exhausted
        else: # j == 0
            new_freepools = freepools
            allcands = elt2pools.keys()

        for result[j] in sorted(allcands):
            yield from inner(j + 1, new_freepools)

    return inner(0, set(range(npools)))

Note: this has its own classes of "bad cases". For example, passing 128 copies of [0, 1] consumes about 2 minutes(!) of time on my box to find the 129 results. The POC code takes under a second, while some of the non-matching approaches appear instantaneous.

I won't go into detail about why. Suffice it to say that because all the pools are the same, they all remain "free pools" no matter how deep the recursion goes. match() never has a hard time, always finding a complete match for the prefix in its first (greedy) pass, but even that takes time proportional to the square of the current prefix length (== the current recursion depth).

Pragmatic hybrid

One more here. As noted before, the matching-based approaches suffer some from the expense of graph matching as a fundamental operation repeated so often, and have some unfortunate bad cases pretty easy to stumble into.

Highly similar pools cause the set of free pools to shrink slowly (or not at all). But in that case the pools are so similar that it rarely matters which pool an element is taken from. So the approach below doesn't try to keep exact track of free pools, picks arbitrary pools so long as such are obviously available, and resorts to graph-matching only when it gets stuck. That seems to work well. As an extreme example, the 129 results from 128 [0, 1] pools are delivered in less than a tenth of second instead of in two minutes. It turns out it never needs to do graph-matching in that case.

Another problem with the POC code (and less so for the other match-based approach) was the possibility of spinning wheels for a long time after the last result was delivered. A pragmatic hack solves that one completely ;-) The last tuple of the sequence is easily computed in advance, and code raises an internal exception to end everything immediately after the last tuple is delivered.

That's it for me! A generalization of the "two inputs" case would remain very interesting to me, but all the itches I got from the other approaches have been scratched now.

def combogen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(set)
    npools = 0
    cands = []
    MAXTUPLE = []
    for i, count in Counter(ixs).items():
        indices = set(range(npools, npools + count))
        huge = None
        for elt in pools[i]:
            elt2pools[elt] |= indices
            for i in range(count):
                cands.append(elt)
            if huge is None or elt > huge:
                huge = elt
        MAXTUPLE.extend([huge] * count)
        npools += count
    MAXTUPLE = tuple(sorted(MAXTUPLE))
    cands.sort()
    ncands = len(cands)
    ALLPOOLS = set(range(npools))
    availpools = ALLPOOLS.copy()
    result = [None] * npools

    class Finished(Exception):
        pass

    # Is it possible to match result[:n]? If not, return None.  Else
    # return a matching, a dict whose keys are pool indices and
    # whose values are a permutation of result[:n].
    def match(n):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return None
        return y2x

    def inner(i, j):  # fill result[j:] with cands[i:]
        nonlocal availpools
        if j >= npools:
            r = tuple(result)
            yield r
            if r == MAXTUPLE:
                raise Finished
            return
        restore_availpools = None
        last = None
        jp1 = j + 1
        for i in range(i, ncands):
            elt = cands[i]
            if elt == last:
                continue
            last = result[j] = elt
            pools = elt2pools[elt] & availpools
            if pools:
                pool = pools.pop() # pick one - arbitrary
                availpools.remove(pool)
            else:
                # Find _a_ matching, and if that's possible fiddle
                # availpools to pretend that's the one we used all
                # along.
                m = match(jp1)
                if not m: # the prefix can't be extended with elt
                    continue
                if restore_availpools is None:
                    restore_availpools = availpools.copy()
                availpools = ALLPOOLS - m.keys()
                # Find a pool from which elt was taken.
                for pool, v in m.items():
                    if v == elt:
                        break
                else:
                    assert False
            yield from inner(i+1, jp1)
            availpools.add(pool)

        if restore_availpools is not None:
            availpools = restore_availpools

    try:
        yield from inner(0, 0)
    except Finished:
        pass
Saxen answered 17/8, 2018 at 0:44 Comment(15)
Thank you for posting this. I too started developing an algorithm that would generate lexicographical "Cartestian" combinations. My code started out okay for a few cases and it was very efficient, but as the cases got more general, the complexity increase seemed to tip the scales in favor the hash table above. Also, I don't know if I am convinced that the combinations_with_repetition or multiset_combinations is a better alternative... continuingHopper
As I stated in my answers, these will inevitably generating impossible "Cartestian" combinations, and the added effort to check that each result is indeed a valid combination in addition to checking for duplicates seems to make this option more of a headache.Hopper
@JosephWood, that's where the complexity comes in: not in checking whether a combination is valid, but in generating only valid combinations to begin with.Saxen
I agree with the custom generator approach using some preprocessing seems most likely (a more complicated CWR type idea)Unchaste
Not sure why this answer doesn't have more upvotes. My original upvote was for confirming that I was not alone in my thinking. I have been working on this problem for more than a couple of days now, and many of the ideas you mentioned originally were in line with what I had experienced to that point. The combination deduplication example would have gotten my second upvote. And finally, your last bit has been my aim for the past 24 hours (i.e. not generating lexicographical combinations per se, but building sorted lexicographical combos) would have gotten my third, fourth, etc.Hopper
Here is what I have so far on generating sorted lexicographical tuples. We first take all pools and add a grouping attribute (i.e. the first pool is group 1, the second pool (doesn't matter if it is the same as the first) is group 2, and so on). Then we combine them and sort them based on value (not by group). Now, we proceed from left to right using the grouping attribute to eliminate available values. For example: p1 = (3 5 2), p2 = (6 2 2), p3 = (4 3 2). We combine and sort giving : (2 2 2 2 3 3 4 5 6) with the corresponding grouping vector of (1 2 2 3 1 3 3 1 2). Now...Hopper
We can start building. We pick the first value of 2 with group 1 (g = 1), eliminate all values with g = 1, giving (2 2 2 3 4 6) and groups (2 2 3 3 3 2). Now we pick the first value of 2 with g = 2, eliminate all values with g = 2, giving (2 3 4) and groups (3 3 3). Now we build three combinations of (2 2 2), (2 2 3), (2 2 4). Now we go up one group, increment until our second value changes (we skip the 2's), so we get a value of 3 with g = 3. This will give a combo of (2 3 6). And so on. Now, you will note that this won't give strict order as (2 3 3) is possible, but...Hopper
That combination (i.e. (2 3 3)) will be encountered with this method. My code got really busy, and I'm not sure if it ever creates duplicates, but it seems to have merit. Anywho, thanks again for your great ideas.Hopper
Last comment, I just realized that if we first made sure that each pool had unique values only, then we would have stronger guarantees that duplicate combinations will be avoided with the method I outlined. So, we would first modify p2 = (6 2 2) to be only p2 = (6 2) and then continue with the grouping and whatnot.Hopper
@JosephWood, yup, we're thinking very much alike! You can assume that each pool is a set (no duplicates), because the OP added this comment to another answer yesterday: "We can assume the values in the pool are unique (a one-time operation) ". Agreed that "God's solution" wouldn't use combinations_with_replacement (cwr) at all, but rather treat a group used N times as N distinct groups - the best approach would surely deduce all on its own what cwr knows.Saxen
@TimPeters, I really appreciate your effort here. This problem has captivated my brain since it was posted and the knowledge bombs that you continuously drop have helped tremendously. I'm not going to give up on this problem and I'll be sure to post something if I find something new that is helpful.Hopper
@JosephWood, and I appreciate the appreciation ;-) It's an irony on StackOverflow that the hardest problems I work hardest on get the fewest upvotes ;-) I had tried the scheme you sketched but never got it to wholly work. The pair [2, 3] and [2, 4] illustrates a key problem: to get (2, 3) requires picking 2 from the second pair, but to get (2, 4) requires picking 2 from the 1st pair. So any scheme that traverses the possibilities strictly "left to right" can't work. I have yet to find an efficient way around that does always work. I've been waiting for you :-)Saxen
@TimPeters So any scheme that traverses the possibilities strictly "left to right" can't work I have posted a preliminary answer that generates the combinations strictly left-to-right. Tomorrow I will provide details about an efficient implementation of that idea.Giffin
@Leon, my telegraphic comment was addressed to a specific class of algorithm JospehWood sketched. I think you're taking it out of context, because what I said pretty obviously applies to what it was intended to apply to ;-)Saxen
@TimPeters, incredible work!!! I worked on this problem for far too long and when I came back to the post (I lied since I promised I would not return until I had completed a working implementation (see my comment Leon's post)), to my delight, you had put together a beautiful array of ideas. Even if we still don't have the perfect solution, your work here has laid a strong foundation for such a solution.Hopper
H
7

This is a difficult problem. I think your best bet in the general case is to implement a hash table where the key is a multiset and the value is your actual combination. This is similar to what @ErikWolf mentioned, however this methods avoids producing duplicates in the first place so filtering is not required. It also returns correct result when we encounter multisets.

There is a faster solution that I am teasing now, but saving for later. Bear with me.

As mentioned in the comments, one approach that seems viable is to combine all of the pools and simply generate combinations of this combined pool choose the number of pools. You would need a tool that is capable of generating combinations of multisets, which there is one that I know of that is available in python. It is in the sympy library from sympy.utilities.iterables import multiset_combinations. The problem with this is that we still produce duplicate values and worse, we produce results that are impossible to obtain with an analogous set and product combo. For example, if we were to do something like sort and combine all of the pools from the OP and apply the following:

list(multiset_permutations([1,2,2,3,3,4,4,5]))

A couple of the results would be [1 2 2] and [4 4 5] which are both impossible to obtain from [[1, 2, 3], [2, 3, 4], [3, 4, 5]].

Outside of special cases, I don't see how it is possible to avoid checking every possible product. I hope I am wrong.

Algorithm Overview
The main idea is to map combinations of our product of vectors to unique combinations without having to filter out duplicates. The example given by the OP (i.e. (1, 2, 3) and (1, 3, 2)) should only map to one value (either one of them, as order doesn't matter). We note that the two vectors are identical sets. Now, we also have situations like :

vec1 = (1, 2, 1)
vec2 = (2, 1, 1)
vec3 = (2, 2, 1)

We need vec1 and vec2 to map to the same value whereas vec3 needs to map to its own value. This is the problem with sets as all of these are equivalent sets (with sets, the elements are unique thus {a, b, b} and {a, b} are equivalent).

This is where multisets come into play. With multisets, (2, 2, 1) and (1, 2, 1) are distinct, however (1, 2, 1) and (2, 1, 1) are the same. This is good. We now have a method to generate unique keys.

As I am not a python programmer, so I will proceed in C++.

We will have some issues if we try to implement everything above exactly as is. As far as I know, you can't have std::multiset<int> as the key portion for a std::unordered_map. However, we can for a regular std::map. It isn't as performant as a hash table underneath (it is actually a red-black tree), but it still gives decent performance. Here it is:

void cartestionCombos(std::vector<std::vector<int> > v, bool verbose) {

    std::map<std::multiset<int>, std::vector<int> > cartCombs;

    unsigned long int len = v.size();
    unsigned long int myProd = 1;
    std::vector<unsigned long int> s(len);

    for (std::size_t j = 0; j < len; ++j) {
        myProd *= v[j].size();
        s[j] = v[j].size() - 1;
    }

    unsigned long int loopLim = myProd - 1;
    std::vector<std::vector<int> > res(myProd, std::vector<int>());
    std::vector<unsigned long int> myCounter(len, 0);
    std::vector<int> value(len, 0);
    std::multiset<int> key;

    for (std::size_t j = 0; j < loopLim; ++j) {
        key.clear();

        for (std::size_t k = 0; k < len; ++k) {
            value[k] = v[k][myCounter[k]];
            key.insert(value[k]);
        }

        cartCombs.insert({key, value});

        int test = 0;
        while (myCounter[test] == s[test]) {
            myCounter[test] = 0;
            ++test;
        }

        ++myCounter[test];
    }

    key.clear();
    // Get last possible combination
    for (std::size_t k = 0; k < len; ++k) {
        value[k] = v[k][myCounter[k]];
        key.insert(value[k]);
    }

    cartCombs.insert({key, value});

    if (verbose) {
        int count = 1;

        for (std::pair<std::multiset<int>, std::vector<int> > element : cartCombs) {
            std::string tempStr;

            for (std::size_t k = 0; k < len; ++k)
                tempStr += std::to_string(element.second[k]) + ' ';

            std::cout << count << " : " << tempStr << std::endl;
            ++count;
        }
    }
}

With test cases of 8 vectors of lengths from 4 to 8 filled with random integers from 1 to 15, the above algorithm runs in about 5 seconds on my computer. That's not bad considering we are looking at nearly 2.5 million total results from our product, but we can do better. But how?

The best performance is given by std::unordered_map with a key that is built in constant time. Our key above is built in logarithmic time (multiset, map and hash map complexity). So the question is, how can we overcome these hurdles?

Best Performance

We know we must abandon std::multiset. We need some sort of object that has a commutative type property while also giving unique results.

Enter the Fundamental Theorem of Arithmetic

It states that every number can be uniquely represented (up to the order of the factors) by the product of primes numbers. This is sometimes called the prime decomposition.

So now, we can simply proceed as before but instead of constructing a multiset, we map each index to a prime number and multiply the result. This will give us a constant time construction for our key. Here is an example showing the power of this technique on the examples we created earlier above (N.B. P below is a list of primes numbers... (2, 3, 5, 7, 11, etc.):

                   Maps to                    Maps to            product
vec1 = (1, 2, 1)    -->>    P[1], P[2], P[1]   --->>   3, 5, 3    -->>    45
vec2 = (2, 1, 1)    -->>    P[2], P[1], P[1]   --->>   5, 3, 3    -->>    45
vec3 = (2, 2, 1)    -->>    P[2], P[2], P[1]   --->>   5, 5, 3    -->>    75

This is awesome!! vec1 and vec2 map to the same number, whereas vec3 gets mapped to a different value just as we wished.

void cartestionCombosPrimes(std::vector<std::vector<int> > v, 
                        std::vector<int> primes,
                        bool verbose) {

    std::unordered_map<int64_t, std::vector<int> > cartCombs;

    unsigned long int len = v.size();
    unsigned long int myProd = 1;
    std::vector<unsigned long int> s(len);

    for (std::size_t j = 0; j < len; ++j) {
        myProd *= v[j].size();
        s[j] = v[j].size() - 1;
    }

    unsigned long int loopLim = myProd - 1;
    std::vector<std::vector<int> > res(myProd, std::vector<int>());
    std::vector<unsigned long int> myCounter(len, 0);
    std::vector<int> value(len, 0);
    int64_t key;

    for (std::size_t j = 0; j < loopLim; ++j) {
        key = 1;

        for (std::size_t k = 0; k < len; ++k) {
            value[k] = v[k][myCounter[k]];
            key *= primes[value[k]];
        }

        cartCombs.insert({key, value});

        int test = 0;
        while (myCounter[test] == s[test]) {
            myCounter[test] = 0;
            ++test;
        }

        ++myCounter[test];
    }

    key = 1;
    // Get last possible combination
    for (std::size_t k = 0; k < len; ++k) {
        value[k] = v[k][myCounter[k]];
        key *= primes[value[k]];
    }

    cartCombs.insert({key, value});
    std::cout << cartCombs.size() << std::endl;

    if (verbose) {
        int count = 1;

        for (std::pair<int, std::vector<int> > element : cartCombs) {
            std::string tempStr;

            for (std::size_t k = 0; k < len; ++k)
                tempStr += std::to_string(element.second[k]) + ' ';

            std::cout << count << " : " << tempStr << std::endl;
            ++count;
        }
    }
}

On the same example above that would generate nearly 2.5 million products, the above algorithm returns the same result in less than 0.3 seconds.

There are a couple of caveats with this latter method. We must have our primes generated a priori and if we have many vectors in our Cartesian product, the key could grow beyond the bounds of int64_t. The first issue shouldn't be that difficult to overcome as there are many resources (libraries, lookup tables, etc.) available for generating prime numbers. I'm not really sure, but I've read that the latter issue shouldn't be a problem for python as integers have arbitrary precision (Python integer ranges).

We also have to deal with the fact that our source vectors might not be nice integer vectors with small values. This can be remedied by ranking all of the elements across all vectors before you proceed. For example, given the following vectors:

vec1 = (12345.65, 5, 5432.11111)
vec2 = (2222.22, 0.000005, 5)
vec3 = (5, 0.5, 0.8)

Ranking them, we would obtain:

rank1 = (6, 3, 5)
rank2 = (4, 0, 3)
rank3 = (3, 1, 2)

And now, these can be used in place of the actual values to create your key. The only portion of the code that would change would be the for loops that build the key (and of course the rank object that would need to be created):

for (std::size_t k = 0; k < len; ++k) {
    value[k] = v[k][myCounter[k]];
    key *= primes[rank[k][myCounter[k]]];
}

Edit:
As some of the commenters have pointed out, the above method disguises the fact that all products must be generated. I should have said that the first time around. Personally, I don't see how it can be avoided given the many different presentations.

Also, in case anybody is curious, here is the test case I used above:

[1 10 14  6],
[7  2  4  8  3 11 12],
[11  3 13  4 15  8  6  5],
[10  1  3  2  9  5  7],
[1  5 10  3  8 14],
[15  3  7 10  4  5  8  6],
[14  9 11 15],
[7  6 13 14 10 11  9  4]

It should return 162295 unique combinations.

Hopper answered 16/8, 2018 at 21:51 Comment(10)
I think you made a mistake in your ranking, you say the second element in vec1 and the last element vec3 are ranked one, when i believe they shoud be ranked 3Brammer
It looks to me like you're generating duplicate combinations and filtering eagerly, rather than avoiding duplicates entirely.Vomitory
@user2357112, are you alluding to the fact that I have to generate all products, but I'm not really keeping all of them?Hopper
@user2357112, I guess a better of saying is maybe "I'm filtering on the fly"?Hopper
I agree with @Vomitory Your method is really smart, and essentialy doing what the OP asked, but basically what you are doing could be done by creating every product and for each one saying if product not in allCombs: allCombs.append(product) Still a +1 for me though.Brammer
the prime number construction is clever, but it seems the same as generating all combos, sorting them (in constant time since the number of parts is fixed) and removing duplicates in a hash tableUnchaste
@qwr, you are correct, however there is a major save in memory complexity. Also, I just edited my question. I don't think it is possible to avoid generating all products. I hope I am wrong because I am very interested in this type of problem myself.Hopper
@qwr, how is sorting constant time? I don't really follow what you mean by "the number of parts is fixed".Hopper
Ok what I said was a little inaccurate. Just that for my application I use each combination in a way such that order doesn't matter. But I present here a more general problemUnchaste
@qwr, I still think you will have issues as I stated in my answer, you can't use set and obtain reliable results. See the example with [1 2 1] [2 1 1] & [2 2 1]. If you make these into hashable objects they will all return the same element and that is not what you want. So I'm pretty sure your current plan will have a lot of unexpected overhead to get your unique combinations which will cost.Hopper
V
7

One way to save some work might be to generate deduplicated combinations of the first k chosen pools, then extend those to deduplicated combinations of the first k+1. This lets you avoid individually generating and rejecting all length-20 combinations that picked 2, 1 instead of 1, 2 from the first two pools:

def combinations_from_pools(pools):
    # 1-element set whose one element is an empty tuple.
    # With no built-in hashable multiset type, sorted tuples are probably the most efficient
    # multiset representation.
    combos = {()}
    for pool in pools:
        combos = {tuple(sorted(combo + (elem,))) for combo in combos for elem in pool}
    return combos

With the input sizes you're talking about, though, no matter how efficiently you generate combinations, you'll never be able to process all of them. Even with 20 identical 1000-element pools, there would be 496432432489450355564471512635900731810050 combinations (1019 choose 20, by the stars and bars formula), or about 5e41. If you conquered the Earth and devoted all the processing power of all the computing equipment of all of humanity to the task, you still couldn't make a dent in that. You need to find a better way to solve your underlying task.

Vomitory answered 17/8, 2018 at 0:14 Comment(1)
This is a good heuristic. And I should clarify not all pools have thousands of elements.Unchaste
G
5

Answers that have been posted so far (including lazy lexicographic one-at-a-time generation by Tim Peters) have worst-case space complexity proportional to the size of the output. I am going to outline an approach that will constructively produce all unique unordered combinations without deduplication of internally generated intermediate data. My algorithm generates the combinations in a lexicographically sorted order. It has a computational overhead compared to the simpler algorithms. However it can be parallelized (so that different ranges of the final output can be produced concurrently).

The idea is the following.

So we have N pools {P1, ..., PN} where we must draw our combinations from. We can easily identify the smallest combination (with respect to the mentioned lexicographical ordering). Let it be (x1, x2 ..., xN-1, xN) (where x1 <= x2 <= ... <= xN-1 <= xN, and each xj is just the smallest element from one of the pools {Pi}). This smallest combination will be followed by zero or more combinations where the prefix x1, x2 ..., xN-1 is the same and the last position runs over an increasing sequence of values. How can we identify that sequence?

Let's introduce the following definition:

Given a combination prefix C=(x1, x2 ..., xK-1, xK) (where K < N), a pool Pi is called free with respect to C if the latter (the prefix) can be drawn from the rest of the pools.

Identifying free pools for a given prefix is easily reduced to the problem of finding maximum matchings in a bipartite graph. The challenging part is doing it efficiently (taking advantage of the specifics of our case). But I will save it for later (this is work in progress, to be materialized as a Python program in a day).

So, for the prefix (x1, x2 ..., xN-1) of our first combination we can identify all the free pools {FPi}. Any one of them can be used to pick an element for the last position. Therefore the sequence of interest is the sorted set of elements from {FP1 U FP2 U ... } that are greater than or equal to xN-1.

When the last position is exhausted we must increase the last-but-one position whereupon we will repeat the procedure of finding the possible values for the last position. Unsuprisingly, the procedure of enumerating the values for the last-but-one (as well as any other) position is the same - the only difference is the length of the combination prefix based on which free pools must be identified.

Thus the following recursive algorithm should do the work:

  1. Start with an empty combination prefix C. At this point all pools are free.
  2. If the length of C is equal to N, then output C and return.
  3. Merge the free pools into one sorted list S and drop from it all elements that are less than the last element of C.
  4. For each value x from S do
    • The new combination prefix is C'=(C, x)
    • With the current combination prefix having grown by one, some of the free pools stop being free. Identify them and recurse into step 1 with an updated free pool list and combination prefix C'.
Giffin answered 23/8, 2018 at 0:26 Comment(4)
I don't yet follow this, but just a note to point out that my "lazy" scheme has worst-case memory use proportional to the penultimate (not final) sequence; if there are N pools, to the product of the sizes of the first N-1 pools. For example, passing [range(1_000_000), range(10_000_000)] works fine and iterating over it runs smoothly on my box with flat memory use despite that I don't have nearly enough RAM to hold even a significant fraction of 10 trillion tuples. tee() wasn't really designed to fork 10 million streams, though ;-) A scheme with tiny use would be great!Saxen
OK. I think I grasp this now - cool! It seems clear it can work, and probably with minimal memory burden. It's unclear to me how fast it can be made. Just FYI, I found it easier to grasp after replacing the concept of "free" with respect to C by "fixed" with respect to C, where pool P is fixed wrt C if P is part of every maximal match. That is, P must be used in constructing C (for example, P is the only pool containing some element of C). The set of fixed pools is just the complement of the set of free pools.Saxen
This looks really interesting. I’m going to try and implement this myself without looking back here until I’m finished. I’m eager to see how you handle this in parallel manner.Hopper
@JosephWood, adding parallelism would be trivial. For example, if U is the union of all pools, give each of len(U) processes a singleton prefix containing one of U's elements, and change the code to return when it would otherwise have looked for an alternative in the first tuple position. There are approximately a bazillion ways to spread the work out ;-) Finding an optimal way is probably hard, though, since we don't have an efficient way to calculate in advance how many result tuples are in a given range of tuples.Saxen
A
3

You could implement an hashable list and use python set() to filter all duplicates. Your hash-function just needs to ignore the order in the list which can be achieved by using collections.Counter

from collections import Counter

class HashableList(list):
    def __hash__(self):
        return hash(frozenset(Counter(self)))
    def __eq__(self, other):
        return hash(self) == hash(other)

x = HashableList([1,2,3])
y = HashableList([3,2,1])

print set([x,y])

This returns:

set([[1, 2, 3]])
Ayr answered 14/8, 2018 at 7:46 Comment(2)
I don't see what's the point of the subclass. Just do set(map(lambda x: frozenset(Counter(x)), the_sequence))... subclassing built-in is practically never the right answer. They do not really behave as most people expect, and here you aren't really achieving anything except from changing map(lambda x: frozenset(Counter(x)), ...) to map(HashableList, ...).Lesser
I know how to filter duplicates. My question is about not generating them in the first place.Unchaste
A
2

This is what I came up with:

class Combination:
    def __init__(self, combination):
        self.combination = tuple(sorted(combination))

    def __eq__(self, other):
        return self.combination == self.combination

    def __hash__(self):
        return self.combination.__hash__()

    def __repr__(self):
        return self.combination.__repr__()

    def __getitem__(self, i):
        return self.combination[i]

Then,

pools = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
part = (0, 0, 1)
set(Combination(combin) for combin in product(*(pools[i] for i in part)))

Outputs:

{(1, 1, 2),
 (1, 1, 3),
 (1, 1, 4),
 (1, 2, 2),
 (1, 2, 3),
 (1, 2, 4),
 (1, 3, 3),
 (1, 3, 4),
 (2, 2, 2),
 (2, 2, 3),
 (2, 2, 4),
 (2, 3, 3),
 (2, 3, 4),
 (3, 3, 3),
 (3, 3, 4)}

Not sure if this is really what you're looking for.

Affect answered 16/8, 2018 at 23:47 Comment(7)
This is just generating all the combinations and then filtering them out, which the other 2 answers do.Unchaste
@Unchaste I'm sorry, but are you suggesting that there should be some method of determining whether to generate a particular combination before generating it? If so, I disagree entirely. As to this method generating all the combinations, then filtering, I also disagree entirely. This method will generate the combinations one-by-one, and add them to the resulting set comprehension only when they they aren't already in the set.Affect
Yes. In my OP I give an example of how this could be done for values from the same pool (combinations_with_replacement AFAIK never generates extraneous combinations)Unchaste
@Unchaste It certainly will if you have repeat values. From the documentation: "Elements are treated as unique based on their position, not on their value. So if the input elements are unique, the generated combinations will also be unique." See: docs.python.org/3/library/…Affect
Also from the documentation: "The code for combinations_with_replacement() can be also expressed as a subsequence of product() after filtering entries where the elements are not in sorted order (according to their position in the input pool)".Affect
We can assume the values in the pool are unique (a one-time operation)Unchaste
I can't prove it, but I suspect the following is true: assume you index each pool in your list (1, 2, ..., n). Now assume there exist m numbers in m groups where m < n such that you can swap each number at most one time. When you do, you swap the indices of the 2 groups you're swapping between. If you are able to create a cycle with that permutation of indices (i.e., reading left to right, starting at the smallest number, then circling around), you are guaranteed to have non-unique combinations.Affect

© 2022 - 2024 — McMap. All rights reserved.