Creating a matplotlib or seaborn histogram which uses percent rather than count?
Asked Answered
M

5

9

Specifically I'm dealing with the Kaggle Titanic dataset. I've plotted a stacked histogram which shows ages that survived and died upon the titanic. Code below.

figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, bins=30, label=['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()

I would like to alter the chart to show a single chart per bin of the percentage in that age group that survived. E.g. if a bin contained the ages between 10-20 years of age and 60% of people aboard the titanic in that age group survived, then the height would line up 60% along the y-axis.

Edit: I may have given a poor explanation to what I'm looking for. Rather than alter the y-axis values, I'm looking to change the actual shape of the bars based on the percentage that survived.

The first bin on the graph shows roughly 65% survived in that age group. I would like this bin to line up against the y-axis at 65%. The following bins look to be 90%, 50%, 10% respectively, and so on.

The graph would end up actually looking something like this:

enter image description here

Macy answered 17/10, 2016 at 17:26 Comment(1)
The library Dexplot is able to create a stacked bar plot of percentages. See my answer below.Aftergrowth
A
2

Perhaps the following will help ...

  1. Split the dataframe based on 'Survived'

    df_survived=df[df['Survived']==1]
    df_not_survive=df[df['Survived']==0]
    
  2. Create Bins

    age_bins=np.linspace(0,80,21)
    
  3. Use np.histogram to generate histogram data

    survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
    not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
    
  4. Calculate survival rate in each bin

    surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
    
  5. Plot

    plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
    plt.xlabel('Age')
    plt.ylabel('Survival Rate')
    

enter image description here

Anaptyxis answered 7/2, 2017 at 4:32 Comment(1)
to avoid all floating values rounded to 0. Use the following from __future__ import divisionMillimeter
K
5

For Seaborn, use the parameter stat. According to the documentation, currently supported values for the stat parameter are:

  • count shows the number of observations
  • frequency shows the number of observations divided by the bin width
  • density normalizes counts so that the area of the histogram is 1
  • probability normalizes counts so that the sum of the bar heights is 1
  • percent normalize such that bar heights sum to 100

The result with stat being count:

seaborn.histplot(
    data=data,
    x='variable',
    discrete=True,
    stat='count'
)

Histogram result for stat=count

The result after stat is changed to probability:

seaborn.histplot(
    data=data,
    x='variable',
    discrete=True,
    stat='probability'
)

Histogram result for stat=probability

Knesset answered 12/1, 2021 at 11:34 Comment(1)
What if I want the y-axis to be represented as a percentage, i.e. from 0-100 instead of 0-1? Edit: figured it out. Set stat = 'percent'Rosmunda
B
2

pd.Series.hist uses np.histogram underneath.

Let's explore that

np.random.seed([3,1415])
s = pd.Series(np.random.randn(100))
d = np.histogram(s, normed=True)
print('\nthese are the normalized counts\n')
print(d[0])
print('\nthese are the bin values, or average of the bin edges\n')
print(d[1])

these are the normalized counts

[ 0.11552497  0.18483996  0.06931498  0.32346993  0.39278491  0.36967992
  0.32346993  0.25415494  0.25415494  0.02310499]

these are the bin edges

[-2.25905503 -1.82624818 -1.39344133 -0.96063448 -0.52782764 -0.09502079
  0.33778606  0.77059291  1.20339976  1.6362066   2.06901345]

We can plot these while calculating the mean bin edges

pd.Series(d[0], pd.Series(d[1]).rolling(2).mean().dropna().round(2).values).plot.bar()

enter image description here

ACTUAL ANSWER
OR

We could have simply passed normed=True to the pd.Series.hist method. Which passes it along to np.histogram

s.hist(normed=True)

enter image description here

Britneybritni answered 17/10, 2016 at 17:35 Comment(1)
Great post. Not quite what I was looking for though. I've updated my post.Macy
A
2

Perhaps the following will help ...

  1. Split the dataframe based on 'Survived'

    df_survived=df[df['Survived']==1]
    df_not_survive=df[df['Survived']==0]
    
  2. Create Bins

    age_bins=np.linspace(0,80,21)
    
  3. Use np.histogram to generate histogram data

    survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
    not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
    
  4. Calculate survival rate in each bin

    surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
    
  5. Plot

    plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
    plt.xlabel('Age')
    plt.ylabel('Survival Rate')
    

enter image description here

Anaptyxis answered 7/2, 2017 at 4:32 Comment(1)
to avoid all floating values rounded to 0. Use the following from __future__ import divisionMillimeter
A
1

The library Dexplot is capable of returning relative frequencies of groups. Currently, you'll need to bin the age variable in pandas with the cut function. You can then, use Dexplot.

titanic['age2'] = pd.cut(titanic['age'], range(0, 110, 10))

Pass the variable you would like to count (age2) to the count function. Subdivide the counts with the split parameter and normalize by age2. Also, this might be a good time for a stacked bar plot

dxp.count('age2', data=titanic, split='survived', stacked=True, normalize='age2')

enter image description here

Aftergrowth answered 7/10, 2018 at 20:41 Comment(0)
G
0

First of all it would be better if you create a function that splits your data in age groups

# This function splits our data frame in predifined age groups
def cutDF(df):
    return pd.cut(
        df,[0, 10, 20, 30, 40, 50, 60, 70, 80], 
        labels=['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80'])


data['AgeGroup'] = data[['Age']].apply(cutDF)

Then you can plot your graph as follows:

survival_per_age_group = data.groupby('AgeGroup')['Survived'].mean()

# Creating the plot that will show survival % per age group and gender
ax = survival_per_age_group.plot(kind='bar', color='green')
ax.set_title("Survivors by Age Group", fontsize=14, fontweight='bold')
ax.set_xlabel("Age Groups")
ax.set_ylabel("Percentage")
ax.tick_params(axis='x', top='off')
ax.tick_params(axis='y', right='off')
plt.xticks(rotation='horizontal')             

# Importing the relevant fuction to format the y axis 
from matplotlib.ticker import FuncFormatter

ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
plt.show()
Glasgow answered 18/10, 2016 at 6:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.