Python pandas: conditionally select a uniform sample from a dataframe
Asked Answered
R

3

13

Say I have a dataframe as such

category1  category2   other_col   another_col ....
a          1
a          2
a          2        
a          3
a          3
a          1
b          10
b          10
b          10
b          11
b          11
b          11

I want to obtain a sample from my dataframe so that category1 a uniform number of times. I'm assuming that there are an equal number of each type in category1. I know that this can be done with pandas using pandas.sample(). However, I also want to ensure that that sample I select has category2 equally represented as well. So, for example, if I have a sample size of 5, I would want something such as:

a  1
a  2
b  10
b  11
b  10

I would not want something such as:

a 1
a 1
b 10
b 10
b 10

While this is a valid random sample of n=4, it would not meet my requirements as I want to vary as much as possible the types of category2.

Notice that in the first example, because a was only sampled twice, that 3 was not not represented from category2. This is okay. The goal is to just as uniformly as possible, represent that sample data.

If it helps to provide a clearer example, one could thing having the categories fruit, vegetables, meat, grains, junk. In a sample size of 10, I would want as much as possible to represent each category. So ideally, 2 of each. Then each of those 2 selected rows belonging to the chosen categories would have subcategories that are also represented as uniformly as possible. So, for example, fruit could have a subcategories of red_fruits, yellow_fruits, etc. For the 2 fruit categories that are selected of the 10, red_fruits and yellow_fruits would both represented in the sample. Of course, if we had larger sample size, we would include more of the subcategories of fruit (green_fruits, blue_fruits, etc.).

Royden answered 12/9, 2016 at 19:49 Comment(4)
what about df.drop_duplicates(subset=['category1','category2']).sample(n=4)?Uncharitable
@MaxU IIUC, that would create a bias where the first row matching a specific pair value, is the one chosen. I'm not sure it's what the OP wants.Beiderbecke
hmmm... it's an interesting question...Uncharitable
@MaxU I was thinking that it's possible to build on your excellent comment with df.reindex(np.random.permutation(df.index)).drop_duplicates(subset=['Category1','Category2']).sample(n=4), but even that's a problem - what if the sample size is such that, say, 2 samples are needed from some pair? The drop_duplicates makes this impossible.Beiderbecke
C
2

Trick is building up a balanced array. I provided a clumsy way of doing it. Then cycle through the groupby object sampling by referencing the balanced array.

def rep_sample(df, col, n, *args, **kwargs):
    nu = df[col].nunique()
    m = len(df)
    mpb = n // nu
    mku = n - mpb * nu
    fills = np.zeros(nu)
    fills[:mku] = 1

    sample_sizes = (np.ones(nu) * mpb + fills).astype(int)

    gb = df.groupby(col)

    sample = lambda sub_df, i: sub_df.sample(sample_sizes[i], *args, **kwargs)

    subs = [sample(sub_df, i) for i, (_, sub_df) in enumerate(gb)]

    return pd.concat(subs)

Demonstration

rep_sample(df, 'category1', 5)

enter image description here

Chappelka answered 12/9, 2016 at 23:39 Comment(0)
H
2

Here is a solution that does a true random sample stratified by group (won't get you equal samples every time, but does on average which is probably better from a statistical perspective anyway):

import numpy as np
import pandas as pd

def stratified_sample(df, sample_size_per_class, strat_cols):

    if isinstance(strat_cols, str):
        strat_cols = [strat_cols]

    #make randomizer (separately, in case we need it later?)
    vcs = {}
    randomizer = {}
    for c in strat_cols:

        #calculate number of distinct classes relative to sample size
        _vc = df[c].value_counts()
        vcs[c] = (_vc / sample_size_per_class).round(0).astype(int).to_dict()

        #randomizer will divvy up the bigger groups into chunks of size approximate to the smallest
        randomizer[c] = lambda v: np.random.randint(0, vcs[c][v], size=None)

    #apply randomizer
    randomized_cols = []
    for c in strat_cols:
        stratname = '_stratified_%s' % c
        randomized_cols.append(stratname)
        df[stratname] = df[c].apply(randomizer[c])


    #return first random case which should be approximately n_samples
    return df[df[randomized_cols].max(axis=1) == 0][set(df.columns).difference(randomized_cols)]

To test it:

test = pd.DataFrame({'category1':[0,0,0,0,0,0,1,1,1,1,1,1],
                    'category2':[1,2,2,3,3,1,10,10,10,11,11,11]})

lens = []
for i in range(1000):
    lens.append(
        len(
            stratified_sample(test, 3, ['category1','category2'])
        )
    )

print(np.mean(lens))
Harlandharle answered 30/3, 2019 at 1:24 Comment(0)
D
2

This is straightforward when you use the weights keyword in df.sample:

>>>  df.sample(n = 5, weights = (df['category2'].value_counts()/len(df['category2']))**-1)

output:

    category1   category2
2   "a"         2
1   "a"         2
10  "b"         11
3   "a"         3
11  "b"         11

To explain, the weights look like this:

11    4.0
10    4.0
3     6.0
2     6.0
1     6.0

I just took the percentage count for each value in df['category2'] and then inverted those values, which makes for a nice uniform weight across all values in the series.

Dineric answered 8/1, 2020 at 21:48 Comment(1)
This method would only work if category2 is discrete but not continuous or wide range of unique values.Calvincalvina

© 2022 - 2024 — McMap. All rights reserved.