How to sample from Cartesian product without repetition
Asked Answered
A

7

5

I have a list of sets, and I wish to sample n different samples each containing an item from each set. What I do not want is to have it in order, so, for example, I will get all the samples necessarily with the same item from the first set. I also don't want to create all the Cartesian products as that might not be possible in terms of efficiency... Any idea of how to do it? Or even something to approximate this behaviour?

Example that does not work:

(prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))
Apothem answered 8/2, 2018 at 13:34 Comment(5)
Say your first two sets look like {1, 2, 3} and {2}. You randomly pick 2 from {1, 2, 3}. What do you randomly pick from {2}?Pericarp
you "randomly" pick 2. of course :-)Apothem
list(map(random.choice, map(list, list_of_sets))) will generate such a sample, doing it repeatedly will not avoid repetitions, though.Amberjack
@schwobaseggl It is clean and readable, but probably the inner map should be taken out and saved if done multiple times, for better efficiency.Apothem
@Apothem Yup, you'd absolutely store the the list of lists if done repeatedly. Didn't want to clutter the comments :)Amberjack
K
9

All the above solutions waste a lot of resources for filtering repeated results when it comes to the end of the iteration. That's why I have thought of a method that has (almost) linear speed from start until the very end.

The idea is: Give (only in your head) each result of the standard order cartesian product an index. That would be for example for AxBxC with 2000x1x2 = 4000 elements:

0: (A[0], B[0], C[0])
1: (A[1], B[0], C[0])
...
1999: (A[1999], B[0], C[0])
2000: (A[0], B[0], C[1])
...
3999: (A[1999], B[0], C[1])
done.

So there are still some questions open:

  • How do I get a list of possible indices? Answer: Just multiply 2000*1*2=4000 and every number below that will be a valid index.
  • How do I generate random indices sequentially without repetition? There are two answers: If you want samples with a known sample size n, just use random.sample(xrange(numer_of_indices), n). But if you don't know the sample size yet (more general case), you have to generate indices on the fly to not waste memory. In that case, you can just generate index = random.randint(0, k - 1) with k = numer_of_indices to get the first index and k = number_of_indices - n for the nth result. Just check my code below (be aware, that I use a one sided linked list there to store the done indices. It makes insert operations O(1) operations and we need a lot of insertions here).
  • How do I generate the output from the index? Answer: Well, say our index is i. Then i % 2000 will be the index of A for the result. Now i // 2000 can be treated recursively as the index for the cartesian product of the remaining factors.

So this is the code I came up with:

def random_order_cartesian_product(*factors):
    amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
    index_linked_list = [None, None]
    for max_index in reversed(range(amount)):
        index = random.randint(0, max_index)
        index_link = index_linked_list
        while index_link[1] is not None and index_link[1][0] <= index:
            index += 1
            index_link = index_link[1]
        index_link[1] = [index, index_link[1]]
        items = []
        for factor in factors:
            items.append(factor[index % len(factor)])
            index //= len(factor)
        yield items
Knop answered 22/12, 2018 at 12:13 Comment(5)
Valid! Remove '*' in "def random_order_cartesian_product(*factors):" if your "factors" is a list etcImprovisator
randint is not guaranteed to be periodic over k, you will likely see repetition before enumerating all possibilities.Angelitaangell
@ktb This phenomenon may make the results "predictable" under special conditions but truly random results are basically impossible either way. If true randomness is important to someone, they should use a different function than randint in place.Knop
Well its more an issue of not getting repeats. I actually recently solved this by using the approach in this answer and a linear congruential generator (can create a full arbitrary-period random iteration) instead of randint.Angelitaangell
I get the vibe that by "repeat" you actually mean "the same permutation, more than once". This does not happen though. Do I get you wrong?Knop
D
2

You can use sample from the random lib:

import random
[[random.sample(x,1)[0] for x in list_of_sets] for _ in range(n)]

for example:

list_of_sets = [{1,2,3}, {4,5,6}, {1,4,7}]
n = 3

A possible output will be:

[[2, 4, 7], [1, 4, 7], [1, 6, 1]]

EDIT:

If we want to avoid repetitions we can use a while loop and collect the results to a set. In addition you can check that n is valid and return the Cartesian product for invalid n values:

chosen = set()
if 0 < n < reduce(lambda a,b: a*b,[len(x) for x in list_of_sets]):
    while len(chosen) < n:
        chosen.add(tuple([random.sample(x,1)[0] for x in list_of_sets]))
else:
    chosen = itertools.product(*list_of_sets)
Defunct answered 8/2, 2018 at 13:44 Comment(5)
Nice edit, but note that currently the code might never stop if n is too large.Apothem
@Apothem I assumed n is smaller the the Cartesian productsDefunct
@Defunct sample(x, 1) is the same as choice(x)Amberjack
@schwobaseggl yes. However, choice does not work with sets and I wanted to avoid mapping the sets into other collection typesDefunct
@Apothem - I edited my answer to avoid endless while loopsDefunct
A
1

The following generator function generates non-repetitive samples. It will only work performantly if the number of samples generated is much smaller than the number of possible samples. It also requires the elements of the sets to be hashable:

def samples(list_of_sets):
    list_of_lists = list(map(list, list_of_sets))  # choice only works on sequences
    seen = set()  # keep track of seen samples
    while True:
        x = tuple(map(random.choice, list_of_lists))  # tuple is hashable
        if x not in seen:
            seen.add(x)
            yield x

>>> lst = [{'b', 'a'}, {'c', 'd'}, {'f', 'e'}, {'g', 'h'}]
>>> gen = samples(lst)
>>> next(gen)
('b', 'c', 'f', 'g')
>>> next(gen)
('a', 'c', 'e', 'g')
>>> next(gen)
('b', 'd', 'f', 'h')
>>> next(gen)
('a', 'c', 'f', 'g')
Amberjack answered 8/2, 2018 at 14:12 Comment(0)
I
1

Here is a complete version with an example and some modification for easy understanding and easy use:

import functools
import random

def random_order_cartesian_product(factors):
    amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
    print(amount)
    print(len(factors[0]))
    index_linked_list = [None, None]
    for max_index in reversed(range(amount)):
        index = random.randint(0, max_index)
        index_link = index_linked_list
        while index_link[1] is not None and index_link[1][0] <= index:
            index += 1
            index_link = index_link[1]
        index_link[1] = [index, index_link[1]]
        items = []
        for factor in factors:
            items.append(factor[index % len(factor)])
            index //= len(factor)
        yield items
        

factors=[
    [1,2,3],
    [4,5,6],
    [7,8,9]
]

n = 5

all = random_order_cartesian_product(factors)

count = 0

for comb in all:
  print(comb)
  count += 1
  if count == n:
    break
Improvisator answered 27/6, 2019 at 5:16 Comment(0)
U
1

We create a sequence using a prime number and one of its primitive roots modulo n that visits each number in an interval exactly once. More specifically we are looking for a generator of the multiplicative group of integers modulo n.

We have to pick our prime number a little larger than the product prod([len(i) for i in iterables)], so we have to account for the cases in which we'd get index errors.

import random
from math import gcd
import math

from math import prod
from typing import Iterable


def next_prime(number):
    if number < 0:
        raise ValueError('Negative numbers can not be primes')
    if number <= 1:
        return 2
    if number % 2 == 0:
        number -= 1
    while True:
        number += 2
        max_check = int(math.sqrt(number)) + 2
        for divider in range(3, max_check, 2):
            if number % divider == 0:
                break
        else:
            return number


def is_primitive_root(a, n):
    phi = n - 1
    factors = set()
    for i in range(2, int(phi ** 0.5) + 1):
        if phi % i == 0:
            factors.add(i)
            factors.add(phi // i)
    for factor in factors:
        if pow(a, factor, n) == 1:
            return False
    return True


def find_random_primitive_root(n):
    while True:
        a = random.randint(2, n - 1)
        if gcd(a, n) == 1 and is_primitive_root(a, n):
            return a


class CoordinateMapper:
    """
        A class that maps linear indices to multi-dimensional coordinates within a specified shape.

        Args:
            dims (Iterable[int]): An iterable representing the dimensions of the desired shape.

        Example Usage:
            shape = (2, 3, 5, 4)
            mapper = CoordinateMapper(shape)
            coordinates = mapper.map(10)
            print(coordinates)  # Output: [0, 1, 2, 2]
    """

    def __init__(self, dims: Iterable[int]):
        self.moduli = [prod(dims[i:]) for i in range(len(dims))]
        self.divisors = [prod(dims[i + 1:]) for i in range(len(dims))]

    def map(self, n: int):
        return [(n % self.moduli[i]) // self.divisors[i] for i in range(len(self.moduli))]


def sampler(l):
    close_prime = next_prime(l)
    state = root = find_random_primitive_root(close_prime)
    while state > l:
        state = (state * root) % close_prime  # Inlining the computation leads to a 20% speed up

    yield state - 1
    for i in range(l - 1):
        state = (state * root) % close_prime
        while state > l:
            state = (state * root) % close_prime
        yield state - 1


def _unique_combinations(*iterables):
    cartesian_product_cardinality = prod([len(i) for i in iterables])
    coordinate_mapper = CoordinateMapper([len(i) for i in iterables])
    sequence = sampler(cartesian_product_cardinality)
    for state in sequence:
        yield tuple(iterable[coord] for iterable, coord in zip(iterables, coordinate_mapper.map(state)))

from itertools import product

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
u = _unique_combinations(a, b)

assert sorted(u) == sorted(product(a, b))

I started benchmarking the various approaches. I couldn't get any solution other than @matmarbon's to run without assertion error:

from itertools import product
import time
approaches= {
    'prime_roots':_unique_combinations,
    'matmarbon':random_order_cartesian_product,
    'itertools.product':itertools.product,
}
a = list(range(10))
b = list(range(10))
for name, approach in approaches.items():
    assert sorted(u)==sorted(product(a,b))

For the 2 algorithms I benchmarked the following, with itertools as baseline

import pandas as pd
import timeit
import matplotlib.pyplot as plt

def benchmark(approaches, list_lengths, num_repetitions):
    results = []

    for approach, function in approaches.items():
        for length in list_lengths:
            a = list(range(length))
            b = list(range(length))
            def f():
                for item in function(a,b):
                    pass
            execution_time = timeit.timeit(f, number=num_repetitions)
            entry = {
                'Approach': approach,
                'List Length': length,
                'Execution Time': execution_time
            }
            print(entry)
            results.append(entry)

    results_df = pd.DataFrame(results)

    # Plot the benchmark results
    plt.figure(figsize=(10, 6))
    for approach in approaches.keys():
        approach_results = results_df[results_df['Approach'] == approach]
        plt.plot(approach_results['List Length'], approach_results['Execution Time'], marker='o', label=approach)

    plt.xlabel('List Length')
    plt.ylabel('Execution Time (seconds)')
    plt.title('Benchmark Results')
    plt.grid(True)
    plt.legend()
    plt.show()

list_lengths = [10,20,30,40,50,60,70,80,90,100]
num_repetitions = 3
benchmark(approaches, list_lengths, num_repetitions)

enter image description here It appears that @matmarbon's algorithm, while correct is in O(k^n). The prime roots performs in O(n^k) for k~len(iterables) (assuming somewhat evenly sized iterables)

Memory-wise the prime roots approach wins just because only O(1) memory is required and nothing is stored.

Distribution wise the prime roots approach is not actually random but rather a difficult-to-predict-deterministic sequence. In practice the sequences should be "random" enough.

Credit to this stack overflow answer which inspired the solution.

Ubiquitarian answered 22/6, 2023 at 22:59 Comment(0)
A
0

As I want no repetition, and sometimes it is not possible the code is not that short. But as @andreyF said, random.sample does the work. Perhaps there is also a better way that avoids resampling with repetition until enough non repetitive ones exist, this is the best I have so far.

import operator
import random
def get_cart_product(list_of_sets, n=None):
    max_products_num = reduce(operator.mul, [len(cluster) for cluster in list_of_sets], 1)
    if n is not None and n < max_products_num:
        refs = set()
        while len(refs) < n:
            refs.add(tuple(random.sample(cluster, 1)[0] for cluster in list_of_sets))
        return refs
        return (prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))
    return itertools.product(*list_of_sets)

Note that the code assumes a list of frozen sets, a conversion of random.sample(cluster, 1)[0] should be done otherwise.

Apothem answered 8/2, 2018 at 13:58 Comment(0)
T
0

If your number of random samples is much smaller than the size of your cartesian product, a fast solution is just to maintain n-tuple of indexes which you have already returned.

This has extremely low memory usage (grows linearly in terms of the number of elements to return).

It might become slow if n grows very large, but for less than 1,000 elements, it runs in <1 ms on my system.

def random_cartesian_product(*lists, seed: Optional[int] = None, n: int):
    rnd = random.Random(seed)
    cartesian_idxs: Set[Tuple[int, ...]] = set()
    list_lens: List[int] = [len(l) for l in lists]
    max_count: int = 1
    for l_len in list_lens:
        max_count *= l_len
    if max_count < n:
        raise ValueError(f'At most {max_count} cartesian product elements can be created.')
    while len(cartesian_idxs) < n:
        rnd_idx: Tuple[int, ...] = tuple(
            rnd.randint(0, l_len - 1)
            for l_len in list_lens
        )
        if rnd_idx not in cartesian_idxs:
            cartesian_idxs.add(rnd_idx)
            elem = []
            for l_idx, l in zip(rnd_idx, lists):
                elem.append(l[l_idx])
            yield elem

This can be used iteratively or all-at-once:

list(random_sample_cartesian_product(['a', 'b', 'c'], [1,2], ['x'], n=6, seed=None))

[['b', 1, 'x'],
 ['a', 1, 'x'],
 ['a', 2, 'x'],
 ['c', 1, 'x'],
 ['c', 2, 'x'],
 ['b', 2, 'x']]
Toby answered 2/11, 2023 at 6:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.