Can't get y-axis on Matplotlib histogram to display probabilities
Asked Answered
A

4

5

I have data (pd Series) that looks like (daily stock returns, n = 555):

S = perf_manual.returns
S = S[~((S-S.mean()).abs()>3*S.std())]

2014-03-31 20:00:00    0.000000
2014-04-01 20:00:00    0.000000
2014-04-03 20:00:00   -0.001950
2014-04-04 20:00:00   -0.000538
2014-04-07 20:00:00    0.000764
2014-04-08 20:00:00    0.000803
2014-04-09 20:00:00    0.001961
2014-04-10 20:00:00    0.040530
2014-04-11 20:00:00   -0.032319
2014-04-14 20:00:00   -0.008512
2014-04-15 20:00:00   -0.034109
...

I'd like to generate a probability distribution plot from this. Using:

print stats.normaltest(S)

n, bins, patches = plt.hist(S, 100, normed=1, facecolor='blue', alpha=0.75)
print np.sum(n * np.diff(bins))

(mu, sigma) = stats.norm.fit(S)
print mu, sigma
y = mlab.normpdf(bins, mu, sigma)
plt.grid(True)
l = plt.plot(bins, y, 'r', linewidth=2)

plt.xlim(-0.05,0.05)
plt.show()

I get the following:

NormaltestResult(statistic=66.587382579416982, pvalue=3.473230376732532e-15)
1.0
0.000495624926242 0.0118790391467

graph

I have the impression the y-axis is a count, but I'd like to have probabilities instead. How do I do that? I've tried a whole lot of StackOverflow answers and can't figure this out.

Accipitrine answered 29/7, 2016 at 4:31 Comment(6)
Are you sure that these are counts? I guess they are probability density values as your graph is normalized to 1 when you integrate over it. The range of your x-values is just very small.Limicolous
Could be, probability densities are not my strongest point. How can I at least make these into percentages?Heyes
What percentages do you want to have? For each bin the probability of data being in this bin? Probability density basically means that the integral over the density for some x-range gives you the probability of that range.Limicolous
Yup probability of data being in bin.Heyes
Have you looked at seaborn? Several built-in compound plots that might include what you're looking for (once you figure out the data meaning).Iselaisenberg
@Accipitrine You might want to clarify your question then. Because you say that you want the probability distribution, which is exactly what you did yourself. But apparently you want the probability of a point being in a bin instead. That is something different, though!Limicolous
L
10

There is no easy way (that I know of) to do that using plt.hist. But you can simply bin the data using np.histogram and then normalize the data any way you want. If I understood you correctly, you want the data to display the probability to find a point in a given bin, NOT the probability distribution. That means you have to scale your data that the sum over all bins is 1. That can simply be done by doing bin_probability = n/float(n.sum()).

You will then not have a properly normalized probability distribution function (pdf) anymore, meaning that the integral over an interval will not be a probability! That is the reason, why you have to rescale your mlab.normpdf to have the same norm as your histogram. The factor needed is just the bin width, because when you start from the properly normalized binned pdf the sum over all bins times their respective width is 1. Now you want to have just the sum of bins equal to 1. So the scaling factor is the bin width.

Therefore, the code you end up with is something along the lines of:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

# Produce test data
S = np.random.normal(0, 0.01, size=1000)

# Histogram:
# Bin it
n, bin_edges = np.histogram(S, 100)
# Normalize it, so that every bins value gives the probability of that bin
bin_probability = n/float(n.sum())
# Get the mid points of every bin
bin_middles = (bin_edges[1:]+bin_edges[:-1])/2.
# Compute the bin-width
bin_width = bin_edges[1]-bin_edges[0]
# Plot the histogram as a bar plot
plt.bar(bin_middles, bin_probability, width=bin_width)

# Fit to normal distribution
(mu, sigma) = stats.norm.fit(S)
# The pdf should not normed anymore but scaled the same way as the data
y = mlab.normpdf(bin_middles, mu, sigma)*bin_width
l = plt.plot(bin_middles, y, 'r', linewidth=2)

plt.grid(True)
plt.xlim(-0.05,0.05)
plt.show()

And the resulting picture will be:

enter image description here

Limicolous answered 29/7, 2016 at 11:26 Comment(1)
Thanks for this and dispelling my confusion :)Heyes
M
6

jotasi's answer works, of course, but I'd like to add a very simple trick for achieving this by directly calling hist.

The trick is to use the weights parameter. By default, every data point you pass has a weight of 1. The height of each bin is then the sum of the weights of the data points that fall into that bin. Instead, if we have n points, we can simply make the weight of each point be 1 / n. Then, the sum of the weights of the points that fall into a certain bucket is also the probability that a given point is in that bucket.

In your case, just change the plot line to:

n, bins, patches = plt.hist(S, weights=np.ones_like(S) / len(S),
                            facecolor='blue', alpha=0.75)
Munoz answered 23/8, 2018 at 16:15 Comment(0)
B
0

Gabriel's answer initially didn't work for me. But the reason was that I was also using the density=True parameter. Although it's not explicitly mentioned anywhere, if you use this parameter matplotlib seems to ignore your weight values and doesn't provide you any error either.

Bedroom answered 18/12, 2023 at 12:39 Comment(0)
B
0

The matplotlib plt.hist documentation itself gives hint for a simpler version of this code.

counts, bins = np.histogram(data)
weights = counts/np.sum(counts)
plt.hist(bins[:-1], bins, weights=weights)
Bedroom answered 18/12, 2023 at 14:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.