Randomly extract x items from a list using python
Asked Answered
A

3

9

Starting with two lists such as:

lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted. For example say I wanted 50% the output would be

newLstOne = ['8', '1', '3', '7', '5']
newLstTwo = ['8', '1', '3', '7', '5']

I have achieved this using the following code:

from random import randrange

lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

LengthOfList = len(lstOne)
print LengthOfList

PercentageToUse = input("What Percentage Of Reads Do you want to extract? ")
RangeOfListIndices = []

HowManyIndicesToMake = (float(PercentageToUse)/100)*float(LengthOfList)
print HowManyIndicesToMake

for x in lstOne:
    if len(RangeOfListIndices)==int(HowManyIndicesToMake):
        break
    else:
        random_index = randrange(0,LengthOfList)
        RangeOfListIndices.append(random_index)

print RangeOfListIndices


newlstOne = []
newlstTwo = []

for x in RangeOfListIndices:
    newlstOne.append(lstOne[int(x)])
for x in RangeOfListIndices:
    newlstTwo.append(lstTwo[int(x)])

print newlstOne
print newlstTwo

But I was wondering if there was a more efficient way of doing this, in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?

Thank you

Apiece answered 4/5, 2014 at 17:26 Comment(2)
@devnull You are far too aggressive about marking questions as possible duplicates. The other question asks "how do I make a random sample". This question asks two far more interesting questions, "how do I make the same sample from multiple lists" and "are the built-in randomization functions biased".Rahmann
@RaymondHettinger How could I argue having watched one of your Python videos earlier during the day? (Close vote retracted.)Wendalyn
R
14

Q. I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted.

A. The most straight-forward approach directly matches your specification:

 percentage = float(raw_input('What percentage? '))
 k = len(data) * percentage // 100
 indicies = random.sample(xrange(len(data)), k)
 new_list1 = [list1[i] for i in indicies]
 new_list2 = [list2[i] for i in indicies]

Q. in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?

A. In Python 2 and Python 3, the random.randrange() function completely eliminates bias (it uses the internal _randbelow() method that makes multiple random choices until a bias-free result is found).

In Python 2, the random.sample() function is slightly biased but only in the round-off in the last of 53 bits. In Python 3, the random.sample() function uses the internal _randbelow() method and is bias-free.

Rahmann answered 4/5, 2014 at 17:45 Comment(4)
Thanks for your thorough answer. One problem I have in this code is that you can't input values such as 12.5 percent and get the code to round to the nearest value. How would you implement this in your example?Apiece
Just for clarification I dont mean rounding the percentage value: I mean if you had 1300 items and you wanted 12.5% of these the code would return 163 items (12.5% is 162.5 items) not 169 items (if it rounds the percentage up to 13%)Apiece
@Apiece No worries. I just changed the int conversion to a float conversion.Rahmann
There was still a problem as the index was a float not an integer so I just added in k = round(k) and k = int(k) to round it up. Thanks for the help!Apiece
P
1

Just zip your two lists together, use random.sample to do your sampling, then zip again to transpose back into two lists.

import random

_zips = random.sample(zip(lstOne,lstTwo), 5)

new_list_1, new_list_2 = zip(*_zips)

demo:

list_1 = range(1,11)
list_2 = list('abcdefghij')

_zips = random.sample(zip(list_1, list_2), 5)

new_list_1, new_list_2 = zip(*_zips)

new_list_1
Out[33]: (3, 1, 9, 8, 10)

new_list_2
Out[34]: ('c', 'a', 'i', 'h', 'j')
Proclivity answered 4/5, 2014 at 17:34 Comment(2)
This is a pretty way to do it, but I can't upvote it because it does too much work (looping over the entire population and saving a tuple for each pair). It is better to build a small list of unique indicies and extracting the desired selections.Rahmann
No disagreements here :-)Proclivity
I
1

The way you are doing it looks mostly okay to me.

If you want to avoid sampling the same object several times, you could proceed as follows:

a = len(lstOne)
choose_from = range(a)          #<--- creates a list of ints of size len(lstOne)
random.shuffle(choose_from)
for i in choose_from[:a]:       # selects the desired number of items from both original list
    newlstOne.append(lstOne[i]) # at the same random locations & appends to two newlists in
    newlstTwo.append(lstTwo[i]) # sequence
Indentation answered 4/5, 2014 at 17:44 Comment(2)
This does way too much work for large population sizes. The random.sample() function uses much less memory and makes fewer calls to the random number generator.Rahmann
Thank you kind Sir, you are of course correct. I did not know about random.sample; I learn something every time you post.Indentation

© 2022 - 2024 — McMap. All rights reserved.