Get percentages of a column based off of another column but with different categories
Asked Answered
S

3

5

I have the following Pandas Series:

                    Count
Pclass  Survived    
1       0            80
        1           136
2       0            97
        1            87
3       0           372
        1           119

But I want something like this:

                    Count   Percentage
Pclass  Survived       
1       0           80      37.0
        1           136     63.0
2       0           97      52.7
        1           87      47.3
3       0           372     75.8
        1           119     24.2

I want to calculate the percentages based on only the counts in each category of Pclass (not the whole sum of counts). It would be great if these percentages are calculated using only the Count column.

So far, what I did was I summed up the counts according to Pclass and used .repeat() to duplicate the values and tried to concatenate it to the original Series, which failed spectacularly.

        Count
Pclass  
1       216
1       216
2       184
2       184
3       491
3       491

My idea was to use this repeat column as the denominators in calculating the percentages, like this:

80 / 216 * 100 = 37.0%

and then remove the repeat column after the percentages are calculated. Seems so simple but I can't seem to make it work. Any help is appreciated.

Snotty answered 27/9, 2018 at 5:56 Comment(0)
G
6

div and sum with level=0

df.assign(Pct=df.div(df.sum(level=0), level=0).round(2) * 100)

                 Count   Pct
Pclass Survived             
1      0            80  37.0
       1           136  63.0
2      0            97  53.0
       1            87  47.0
3      0           372  76.0
       1           119  24.0
Goldagoldarina answered 27/9, 2018 at 6:47 Comment(1)
This is exactly what I was looking for.Snotty
S
5

Use GroupBy.transform for return Series with same size as original DataFrame, divide by div, multiple by mul and if necessary round:

#if input is Series create one column DataFrame
df = s.to_frame('Count')

s = df.groupby('Pclass')['Count'].transform('sum')
#if use oldier pandas version
#s = df.groupby(level='Pclass')['Count'].transform('sum')

df['Percentage'] = df['Count'].div(s).mul(100).round()
print (df)
                 Count  Percentage
Pclass Survived                   
1      0            80        37.0
       1           136        63.0
2      0            97        53.0
       1            87        47.0
3      0           372        76.0
       1           119        24.0

EDIT:

dont you have to forward fill on PClass ?

It is not necessary, because default in MultiIndex are not shown, but if set multi_sparse to False is possible verify repeating values:

with pd.option_context('display.multi_sparse', False):
    print (df)
                 Count
Pclass Survived       
1      0            80
1      1           136
2      0            97
2      1            87
3      0           372
3      1           119
Stigmatize answered 27/9, 2018 at 5:58 Comment(2)
Hi @jezrael, dont you have to forward fill on PClass?Oquinn
@VishnuKunchur - No, because MultiIndexStigmatize
B
0

You are working on titanic dataset. You can also do like below on the original data (not on your Series above):

df.Survived.groupby(df.Pclass).value_counts(normalize=True)*100

It will give you count of each class in percentage rather than in actual number.

Bolin answered 27/9, 2018 at 6:7 Comment(2)
Thank you, this also works. But how would you make it into the format that I requested above, with the two columns Count and Percentage?Snotty
Its just another way to get percentage, not in the exact format as you desired. Answer from @Stigmatize above is more than great.Bolin

© 2022 - 2024 — McMap. All rights reserved.