Enumerating combinations in a distributed manner
Asked Answered
S

3

16

I have a problem where I must analyse 500C5 combinations (255244687600) of something. Distributing it over a 10-node cluster where each cluster processes roughly 10^6 combinations per second means the job will be complete in about seven hours.

The problem I have is distributing the 255244687600 combinations over the 10 nodes. I'd like to present each node with 25524468760, however the algorithms I'm using can only produce the combinations sequentially, I'd like to be able to pass the set of elements and a range of combination indicies, for example, [0-10^7), [10^7,2.0 10^7), etc. and have the nodes themselves figure out the combinations.

The algorithms I'm using at the moment are from the following:

I've considered using a master node, that enumerates each of the combinations and sends work to each of the nodes. However, the overhead incurred in iterating the combinations from a single node and communicating back and forth work is enormous, and it will subsequently lead to the master node becoming the bottleneck.

Is there any good combination iterating algorithms geared up for efficient/optimal distributed enumeration?

Sundsvall answered 15/1, 2011 at 8:15 Comment(6)
Not much experience in this area, but it sounds like a problem that google MapReduce could be applied to.Gussie
MapReduce is irrelevant here, as the question is about the "Map" part of the term: How does one efficiently map a n-choose-k space problem into m parts without the need for a central distributor.Sundsvall
@Reyzooti: Hence the "not much experience". Happy to be corrected, though.Gussie
Permutations can be systematically numbered using the factorial number system. In your case only one out of each 495!*5! permutation is a relevant combination. So I gather, you can probably compute the start permutation = combination for each node, then just go on from there. This idea may pan out or not. Depending on the details; it's just an idea. ;-) Cheers & hth.,Trilbie
@Alf: Can you please provide a more in pdeth explanation please.Sundsvall
@Reyzooti: I think larsman's answer below, about using combinatorial numbers, is better idea, more direct. I didn't even know direct combinatorial numbers existed. But I once wrote up the generation of permutations directly from factorial numbers as a letter to the editor of I think it was "Software Development"; they titled it "The third method's the charm" or something like that. Turned out that algorithm had already been invented by someone at TI or HP (I don't recall, but I think it was TI). Cheers,Trilbie
D
3

You may have some success with combinatorial numbers, which allow you to retrieve the N'th (n/10th) k-combination with a simple algorithm; then run the next_combination algorithm n/10 times on each of the ten nodes to iterate.

Sample code (in C#, but quite readable for a C++ programmer) can be found on MSDN.

Dominquedominquez answered 15/1, 2011 at 8:50 Comment(3)
The James McCaffrey article, where he describes a method to get the Nth combination is too expensive. Using next_combination (links) mutates the original range, perhaps something that can determine what the range looks like at the Nth combination, because one could pass that specific range to a compute node.Sundsvall
Why is it too expensive? You only need to run this 10 times on the master, then run next_combination on the compute nodes.Dominquedominquez
@Reyzooti: if you have an index-based thing, then turning it into a RandomAccessIterator is easy --> keep a pointer to the sequence and an index in the iterator :)Maurreen
D
0

Have node number n process every tenth combination, starting from the nth.

Disqualify answered 15/1, 2011 at 9:9 Comment(1)
Still requires each node to iterate over every n-choose-k combos, which results in 90% iteration redudancy per node, less overhead than the master node solution however still more than distributing ranges of combinations.Sundsvall
P
0

I know this question is old, but here is how it may be done efficiently.

All code currently in Python which I'm sure will be easy enough to translate to C++.
- You will probably want to move from using an integer for the characteristic vector to using a bit array, since the integers used will need 500 bits (not a problem in Python)
- Feel free to update to C++ anyone.


  1. Distribute to the nodes their range of combinations (start number and length to process), the set of items from which combinations are to be chosen, and the number, k, of items to choose.
  2. Initialise each node by having it find its starting combination directly from start and items.
  3. Run each node by having it do the work for the first combination then iterate through the rest of its combinations, and do the associated work.

To perform 1 do as you suggest find n-choose-k and divide it into ranges - in your case 500-Choose-5 is, as you said, 255244687600 so, for node=0 to 9 you distribute:
(start=node*25524468760, length=25524468760, items=items, k=5)

To perform 2 you can find the starting combination directly (without iteration) using the combinatorial number system and find the integer representation of the combination's characteristic vector (to be used for the iteration in 3) at the same time:

def getCombination(index, items, k):
    '''Returns (combination, characteristicVector)
    combination - The single combination, of k elements of items, that would be at
    the provided index if all possible combinations had each been sorted in
    descending order (as defined by the order of items) and then placed in a
    sorted list.
    characteristicVector - an integer with chosen item's bits set.
    '''
    combination = []
    characteristicVector = 0
    n = len(items)
    nCk = 1
    for nMinusI, iPlus1 in zip(range(n, n - k, -1), range(1, k + 1)):
        nCk *= nMinusI
        nCk //= iPlus1
    curIndex = nCk
    for k in range(k, 0, -1):
        nCk *= k
        nCk //= n
        while curIndex - nCk > index:
            curIndex -= nCk
            nCk *= (n - k)
            nCk -= nCk % k
            n -= 1
            nCk //= n
        n -= 1
        combination .append(items[n])
        characteristicVector += 1 << n
    return combination, characteristicVector 

The integer representation of the characteristic vector has k bits set in the positions of the items that make up the combination.

To perform 3 you can use Gosper's hack to iterate to the next characteristic vector for the combination in the same number system (the next combination that would appear in a sorted list of reverse sorted combinations relative to items) and, at the same time, create the combination:

def nextCombination(items, characteristicVector):
    '''Returns the next (combination, characteristicVector).
    combination - The next combination of items that would appear after the
    combination defined by the provided characteristic vector if all possible
    combinations had each been sorted in descending order (as defined by the order
    of items) and then placed in a sorted list.
    characteristicVector - an integer with chosen item's bits set.
    '''
    u = characteristicVector & -characteristicVector
    v = u + characteristicVector
    if v <= 0:
        raise OverflowError("Ran out of integers") # <- ready for C++
    characteristicVector =  v + (((v ^ characteristicVector) // u) >> 2)
    combination = []
    copiedVector = characteristicVector
    index = len(items) - 1
    while copiedVector > 0:
        present, copiedVector = divmod(copiedVector, 1 << index)
        if present:
            combination.append(items[index])
        index -= 1
    return combination, characteristicVector

Repeat this length-1 times (since you already found the first one directly).


For example:

Five nodes processing 7-choose-3 letters:

>>> items = ('A','B','C','D','E','F','G')
>>> k = 3
>>> nodes = 5
>>> n =  len(items)
>>> for nmip1, i in zip(range(n - 1, n - k, -1), range(2, k + 1)):
...     n = n * nmip1 // i
...
>>> for node in range(nodes):
...     length = n // nodes
...     start = node * length
...     print("Node {0} initialised".format(node))
...     combination, cv = getCombination(start, items, k)
...     doWork(combination)
...     for i in range(length-1):
...             combination, cv = nextCombination(items, cv)
...             doWork(combination)
...
Node 0 initialised
Doing work with: C B A
Doing work with: D B A
Doing work with: D C A
Doing work with: D C B
Doing work with: E B A
Doing work with: E C A
Doing work with: E C B
Node 1 initialised
Doing work with: E D A
Doing work with: E D B
Doing work with: E D C
Doing work with: F B A
Doing work with: F C A
Doing work with: F C B
Doing work with: F D A
Node 2 initialised
Doing work with: F D B
Doing work with: F D C
Doing work with: F E A
Doing work with: F E B
Doing work with: F E C
Doing work with: F E D
Doing work with: G B A
Node 3 initialised
Doing work with: G C A
Doing work with: G C B
Doing work with: G D A
Doing work with: G D B
Doing work with: G D C
Doing work with: G E A
Doing work with: G E B
Node 4 initialised
Doing work with: G E C
Doing work with: G E D
Doing work with: G F A
Doing work with: G F B
Doing work with: G F C
Doing work with: G F D
Doing work with: G F E
>>>
Peeler answered 7/4, 2016 at 14:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.