How to enumerate combinations filtering repeats
Asked Answered
A

4

5

I have a list of possible choices: [[1], [2, 4], [4], [5, 6, 2], [5, 3]].

I want to list all combinations, taking maximum one element from each sublist, without repeating elements.

So [1, 2, 4, 5, 3] is a valid option. But [1, 4, 4, 5, 3] is not. I allow not making a choice in any sublist, so [1,4, None,5,3] is valid, as in [1, None, None, None, None], and [None, None, None, None, None].

I can't simply enumerate all combinations then filter out the ones I don't want, since for a large list of possible choices, it would quickly become computationally infeasible (I'm looking at 25^25 maximum combinations in my project).

edit: I would also apply some additional criteria to the results, such as filtering to have no more than a threshold of None choices, or sorting the resultant list of combinations in order of combinations with fewest None choices.

edit: with details of the real-life case: I'd like to apply it to a list of 25 sublists, each of which can have 1-25 elements. Realisitically, each sublist will have max 15 elements, with 2-4 on average.

So the easy solution of list(itertools.product(*choices)) then filtering is out.

I may also wish to add other filter conditions to the list of combinations, so ideally I can filter these upfront.

I've tried building a tree recursively, where e.g. root node has the full list of choices, child node makes the first choice [1], and has an updated list of choices where '1' is removed from all list[1:] choices.

Struggling to implement the recursion though.

Can you help me with any other approaches?

Acetaldehyde answered 14/9, 2021 at 20:33 Comment(13)
Could you provide a realistic test case of max size?Historiographer
I'd like to apply it to a list of 25 sublists, each of which can have 1-25 elements (integers 1-25 in fact). Realisitically, each sublist will have max 15 elements, with 4-5 on averageAcetaldehyde
It seems that you only care about unique combinations: i.e. the combination taking a single 4 from [2, 4] is the same as the combination taking that single 4 from [4]. Is this correct? It's also important to know how many elements these lists have in common: are they mostly distinct or mostly equal, or is this unknown?Pruritus
Yes that's right. But if my possible choices were [[4], [2,4]], select 4 then nothing from 2nd sublist is not the same as selecting nothing from 1st sublist and 4 from the 2nd - if that makes sense. And to your second question, the elements are drawn only from the integers 0-25, so they won't be mostly distinct.Acetaldehyde
isn't that what this does? docs.python.org/3/library/itertools.html#itertools.combinationsJoiner
@AdamSmooch, I tried with itertools.product, and it does work for enumerating the combinations, but since I want to filter while enumerating rather than afterwards, I don't think I can use itertools in that way.Acetaldehyde
@Acetaldehyde I'm not sure I get it. Can you elaborate on how 4 then Nothing is different than Nothing then 4?Fons
@KyleParsons, since I will use the result to apply as a mask to a grid. I'm trying to find combinations in a 25x25 grid. Each sublist is effectively a set of possible row choices for a single column. Selecting 4 then nothing would mean 'select row 4 from column 0 and make no choice in column 1', whereas nothing then 4 would mean 'make no choice in column 0 and select row 4 from column 1'Acetaldehyde
Gotcha so the return should be basically be an ordered sequence looking something like [None, 2, None, None, 3, ..., 20, None]?Fons
@KyleParsons, yes, exactly. Sorry I didn't make it clearer to start with, appreciate all the clarifying questions. If it helps, I have a preference for combinations that have the fewest None choices. I was trying to think of a way to enumerate from the bottom up, making maximum non-empty choices first. By 'ordered' I mean preserving the position of each choice, rather than actually ordering the result!Acetaldehyde
@Matiiss Sorry my point was that we need to keep track of which indexes we're not picking from.Fons
@Acetaldehyde What are you going to do with the combinations?Historiographer
@don'ttalkjustcode, I will select the 'best' ones, which have the fewest None choices, or some other criteria, and apply it to my grid so that I have a good combination of rows/columns.Acetaldehyde
P
6

Another way to generate all valid outputs with minimal memory usage is to iterate over the elements rather than over the lists. Use a Depth-First search so that you only generate valid outputs from the start. This means that we need to track three things in each level of our DFS: the current element to maybe add, the lists that are already used, and the order that we used those previous lists in.

To help with our search, we preprocess choices by mapping every element to a set of the possible lists it can be in, which create a sort of Dual version of the problem. This also generates the lists in roughly 'maximum non-empty choices first' order.

Since you've specified that the number of unique elements, N, is equal to the number of lists, this approach runs in O(N * |output|) time, and we use generators everywhere to save memory.

import collections
from typing import Dict, Generator, List, Optional, Set

choices = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
element_to_containing_lists: Dict[int, Set[int]] = collections.defaultdict(set)

for i, choice_list in enumerate(choices):
    for x in choice_list:
        element_to_containing_lists[x].add(i)

all_unique_elements = sorted(element_to_containing_lists)


def dfs(used_list_indices: Set[int],
        next_element_index: int,
        used_list_ordered_indices: List[Optional[int]]) -> Generator[List[Optional[int]]]:
    if next_element_index == len(all_unique_elements):
        yield used_list_ordered_indices
    else:
        # If we use the element, find an unused list index
        for possible_list_to_use in element_to_containing_lists[
                                        all_unique_elements[next_element_index]] - used_list_indices:
            yield from dfs(used_list_indices | {possible_list_to_use},
                           next_element_index + 1,
                           used_list_ordered_indices + [possible_list_to_use])

        # If we don't use the element: Add None as a sentinel value
        yield from dfs(used_list_indices,
                       next_element_index + 1,
                       used_list_ordered_indices + [None])


for element_to_used_list in dfs(set(), 0, []):
    list_to_chosen_element = ['N'] * len(choices)
    for x, y in zip(all_unique_elements, element_to_used_list):
        if y is not None:
            list_to_chosen_element[y] = x
    print(*list_to_chosen_element, sep='  ')

First 10 lines of the output:

1  2  4  5  3
1  2  4  6  3
1  2  4  N  3
1  2  N  5  3
1  2  N  6  3
1  2  N  N  3
1  2  4  5  N
1  2  4  6  5
1  2  4  N  5

This can possibly be optimized to run in O(|output|) time by using a bitmask for 'used lists' rather than a set of their indices.

Pruritus answered 14/9, 2021 at 21:42 Comment(1)
This is clever. Will need a couple of days to digest though. Appreciate itAcetaldehyde
S
5

Don't turn the result into a list. product is a generator. Use that to your advantage. filter is a generator too. You only hold one combination in memory this way. Sometimes a single output of product will be discarded internally without you seeing it, but it won't take up any extra space.

def criterion(x):
    k = [i for i in x if i is not None]
    return len(k) == len(set(k))

choices = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
for c in choices:
    c.append(None)
filter(criterion, product(*choices))
Sedimentology answered 14/9, 2021 at 21:0 Comment(4)
Looks slow, though. They said average 2-4 elements, you add 1 to each, and 4^25 is over 1e15.Historiographer
This works for the minimal example I gave - appreciate it. I think @don'ttalkjustcode is right though, I am running it on my full example, and it looks like it will take a while. Would I also be able to sort the result, selecting (for example) the 100 combinations with the fewest None choices?Acetaldehyde
@AdZinu. You could do some clever chaining to go through the non-None options first,Sedimentology
@AdZinu. Your dataset is enormous. You can't get away from that no matter what you do. My solution may not be super fast, but it's pretty much guaranteed not to clobber your memory no matter how much stuff you're processing.Sedimentology
P
1

Very similar to the previous answer. Not as fast, but I think it's easier to understand. A simple recursive implementation.

def pick_all(choices):
    # The result on an empty choices list is just a single empty list
    if not choices:
        yield []
        return
    # Pluck off the first choice
    first_choice, *remainder = choices

    # Recursively find all solutions to the smaller problem.
    for result in pick_all(remainder):
        # We can always add None to the front
        yield [None, *result]
        for item in first_choice:
            # And we can add any item in this choice that's not already there.
            if item not in result:
                yield [item, *result]
Peek answered 14/9, 2021 at 21:47 Comment(0)
J
0
[set(dataset) for dataset in itertools.product(*datasets)]

Creates a combinations list of sets. Sets are like lists but automatically remove existing values. This code does not make sure the sets are of length 5. Note sets are also unordered.


print([dataset for dataset in itertools.product(*datasets) if len(set(dataset)) == len(dataset)])

List of combinations and filters out any lists with duplicate values by making sure a set of the data is the same length as the original list.


datasets = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
datasets = [[*dataset, None] for dataset in datasets]

new_dataset = []
for dataset in itertools.product(*datasets):
    without_nones = [n for n in dataset if n is not None]
    if len(without_nones) == len(set(without_nones)):
        new_dataset.append(dataset)

print(new_dataset)

Creates a list of combinations where the length is always 5, has no repeating numbers, and can have unlimited Nones.

This should probably be a generator, and I suspect there is a easier way to go about whatever you need a list in this format for.

Jingoism answered 14/9, 2021 at 21:37 Comment(3)
Wouldn't for dataset in itertool.product(*datasets) be enumerating first then filtering? So for a much larger original datasets list, be computationally expensive? MadPhysicist's answer did also use sets to check for duplicatesAcetaldehyde
@Acetaldehyde Yes, it would. However if you allow an unlimited amount of Nones in each sublist then the operation is overly complicated for a list comp or generator expression. In Python performance shouldn't be your first priority and if you are that worried about speed (such as with larger sets) then you should absolutely look into modules like numpy. I'm also curious about what you need 25^25 long lists for.Jingoism
More like 4^25, since not every sublist has the full 25 elements. It could well be the way I've formulated the question is the wrong approach entirely. Will need a couple of days to digest the answers though. Appreciate the help from everyoneAcetaldehyde

© 2022 - 2024 — McMap. All rights reserved.