Randomly extract x items from a list using python

Asked 4/5, 2014 at 17:26 Answered 4/5, 2014 at 17:45

Solved python list random indices python-internals

Starting with two lists such as:

lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted. For example say I wanted 50% the output would be

newLstOne = ['8', '1', '3', '7', '5']
newLstTwo = ['8', '1', '3', '7', '5']

I have achieved this using the following code:

from random import randrange

lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

LengthOfList = len(lstOne)
print LengthOfList

PercentageToUse = input("What Percentage Of Reads Do you want to extract? ")
RangeOfListIndices = []

HowManyIndicesToMake = (float(PercentageToUse)/100)*float(LengthOfList)
print HowManyIndicesToMake

for x in lstOne:
    if len(RangeOfListIndices)==int(HowManyIndicesToMake):
        break
    else:
        random_index = randrange(0,LengthOfList)
        RangeOfListIndices.append(random_index)

print RangeOfListIndices


newlstOne = []
newlstTwo = []

for x in RangeOfListIndices:
    newlstOne.append(lstOne[int(x)])
for x in RangeOfListIndices:
    newlstTwo.append(lstTwo[int(x)])

print newlstOne
print newlstTwo

But I was wondering if there was a more efficient way of doing this, in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?

Thank you

Apiece answered 4/5, 2014 at 17:26 Comment(2)

@devnull You are far too aggressive about marking questions as possible duplicates. The other question asks "how do I make a random sample". This question asks two far more interesting questions, "how do I make the same sample from multiple lists" and "are the built-in randomization functions biased". – Rahmann 4/5, 2014 at 18:9

@RaymondHettinger How could I argue having watched one of your Python videos earlier during the day? (Close vote retracted.) – Wendalyn 5/5, 2014 at 2:5

Q. I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted.

A. The most straight-forward approach directly matches your specification:

 percentage = float(raw_input('What percentage? '))
 k = len(data) * percentage // 100
 indicies = random.sample(xrange(len(data)), k)
 new_list1 = [list1[i] for i in indicies]
 new_list2 = [list2[i] for i in indicies]

Q. in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?

A. In Python 2 and Python 3, the random.randrange() function completely eliminates bias (it uses the internal _randbelow() method that makes multiple random choices until a bias-free result is found).

In Python 2, the random.sample() function is slightly biased but only in the round-off in the last of 53 bits. In Python 3, the random.sample() function uses the internal _randbelow() method and is bias-free.

Rahmann answered 4/5, 2014 at 17:45 Comment(4)

Thanks for your thorough answer. One problem I have in this code is that you can't input values such as 12.5 percent and get the code to round to the nearest value. How would you implement this in your example? – Apiece 4/5, 2014 at 19:15

Just for clarification I dont mean rounding the percentage value: I mean if you had 1300 items and you wanted 12.5% of these the code would return 163 items (12.5% is 162.5 items) not 169 items (if it rounds the percentage up to 13%) – Apiece 4/5, 2014 at 19:24

@Apiece No worries. I just changed the int conversion to a float conversion. – Rahmann 4/5, 2014 at 23:48

There was still a problem as the index was a float not an integer so I just added in k = round(k) and k = int(k) to round it up. Thanks for the help! – Apiece 5/5, 2014 at 7:43

Just zip your two lists together, use random.sample to do your sampling, then zip again to transpose back into two lists.

import random

_zips = random.sample(zip(lstOne,lstTwo), 5)

new_list_1, new_list_2 = zip(*_zips)

demo:

list_1 = range(1,11)
list_2 = list('abcdefghij')

_zips = random.sample(zip(list_1, list_2), 5)

new_list_1, new_list_2 = zip(*_zips)

new_list_1
Out[33]: (3, 1, 9, 8, 10)

new_list_2
Out[34]: ('c', 'a', 'i', 'h', 'j')

Proclivity answered 4/5, 2014 at 17:34 Comment(2)

This is a pretty way to do it, but I can't upvote it because it does too much work (looping over the entire population and saving a tuple for each pair). It is better to build a small list of unique indicies and extracting the desired selections. – Rahmann 4/5, 2014 at 18:40

No disagreements here :-) – Proclivity 4/5, 2014 at 18:49

The way you are doing it looks mostly okay to me.

If you want to avoid sampling the same object several times, you could proceed as follows:

a = len(lstOne)
choose_from = range(a)          #<--- creates a list of ints of size len(lstOne)
random.shuffle(choose_from)
for i in choose_from[:a]:       # selects the desired number of items from both original list
    newlstOne.append(lstOne[i]) # at the same random locations & appends to two newlists in
    newlstTwo.append(lstTwo[i]) # sequence

Indentation answered 4/5, 2014 at 17:44 Comment(2)

This does way too much work for large population sizes. The random.sample() function uses much less memory and makes fewer calls to the random number generator. – Rahmann 4/5, 2014 at 17:57

Thank you kind Sir, you are of course correct. I did not know about random.sample; I learn something every time you post. – Indentation 5/5, 2014 at 0:4

Recommended topics

Hot tags