Randomly sampling from large combination generator
Asked Answered
E

1

9

At a high level, I'm trying to sample n_samples items across all combinations of n items from a list. At small values of n and relatively small list lengths (n <= 5, len(list) < 75) this is fine - I just use itertools to generate combinations, convert to a list, and randomly sample the correct number using random.sample.

However, my use case requires that I generate the combinations, randomly sample several thousand elements, and then remove one of the combinations from the list and start again with the smaller list.

This creates a problem at high values of n and the len(list) - with 120 list items and n = 5, this usecase means that I have to do list conversion many times and I thus become time constrained by generator --> list conversion for a generator with ~190 million items. This takes an extremely long time (in excess of 20 minutes for especially bad examples).

The usecase doesn't require statistically uniform samples or anything, and I am purely using sampling because with high n and long lists processing every possible combination is computationally impractical and fast processing is extremely important.

I switched to using the iterator.islice method to only take the first n_samples items from the generator and use those. That dramatically increases speed (the example which took 20 minutes now takes 34 seconds), but performance is taking a hit. I think this is due to how itertools generates combinations - as an example,

list(itertools.combinations(list(range(4)), 2))

produces this list: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

so it seems that if I have a long enough list and a large enough n, sampling even 100,000+ items by just pulling them off the generator will result in 100,000+ items where the first element is the same which is not ideal. As I said, I don't need perfect random sampling, but I think my performance crash from using this method instead of randomly sampling across the whole list is due to this.

Basically, I need a good way to efficiently sample n_samples items (where n_samples is somewhere from 10k to 500k) from all possible combinations of length n (where n is typically in a range of around 2-8) from a list of length which can vary from ~20 to ~200.

Thanks so much for any advice or resources you can provide!

Excruciating answered 30/4, 2019 at 17:45 Comment(1)
Welcome to stackoverflow! There are some interesting examples in this question regarding sampling directly from a generator without collecting it to a list, that may save you some processing timeSap
P
8

From what you describe, I believe that you'd have a much more effective algorithm if you pick each component randomly, independent of the others, and continue until you have the requisite sample. RNGs (random number generators) are quite fast, enough to make up for needing to replace the occasional duplicate. Store your chosen combinations as a set of tuples (hashable), and you can look up set inclusion in constant time, making the collection linear time. Something like this:

from random import randint

# For illustration, the "lsits" include letters, symbols, 3-letter words, and low primes
list1 = "pythonic"
list2 = "~!@#$%^&*()"
list3 = ["dog", "cat", "ape", "red", "cwm", "pox"]
list4 = [2, 3, 5, 7, 11, 13, 17, 19]

combo = [list1, list2, list3, list4]
my_sample = set()
needed_size = 10

while len(my_sample) < needed_size:
    # Choose one random item from each list; that forms an element
    elem = tuple([comp[randint(0, len(comp)-1)] for comp in combo])
    # Using a set elminates duplicates easily
    my_sample.add(elem)

print(my_sample)

Output:

{('h', '$', 'pox', 7),
 ('y', '(', 'cat', 11),
 ('n', '@', 'cat', 7),
 ('i', '^', 'ape', 13),
 ('y', '#', 'pox', 11),
 ('o', '%', 'dog', 7),
 ('p', '^', 'cwm', 13),
 ('c', '*', 'dog', 19),
 ('o', ')', 'pox', 11),
 ('h', '~', 'cat', 5)}

Another possibility is to generate one random number in the range of the product of the lengths (8 * 10 * 6 * 8 in this case), and then use integer division and mod to break that into your four random indices.

One more possibility is to simply generate your first set of random indices, and then increment those as you see fit, stepping through the lists in turn. You will want your list lengths to be pairwise relative prime in this case; you can guarantee that by adding a None element as needed. Any combination with a None is discarded.

Do those ideas get you moving?

Pelagic answered 30/4, 2019 at 21:57 Comment(1)
This is perfect - I can't believe I hadn't thought of that approach! Thanks so much for your help!Excruciating

© 2022 - 2024 — McMap. All rights reserved.