matplotlib hist function argument density not working
Asked Answered
V

5

20

plt.hist's density argument does not work.

I tried to use the density argument in the plt.hist function to normalize stock returns in my plot, but it didn't work.

The following code worked fine for me and give me the probability density function which I desired.

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19680801)

# example data
mu = 100  # mean of distribution
sigma = 15  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

num_bins = 50

plt.hist(x, num_bins, density=1)

plt.show()

plot shows density

But when I tried it with stock data, it simply didn't work. The result gave the unnormalized data. I didn't find any abnormal data in my data array.

import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
plt.hist(returns, 50,density = True)
plt.show()
# "returns" is a np array consisting of 360 days of stock returns

density not working

Verism answered 7/4, 2019 at 3:52 Comment(9)
What does your actual data look like?Laurentia
some thing like this : array([ 1.88179947e-02, -4.67532468e-03, 9.85850151e-03, 3.38807856e-03, 6.23819607e-03, 1.37640769e-02, -2.24416517e-03, -2.83400810e-02, -4.09722222e-02, -2.89645185e-03, -1.39191479e-02, 4.35743218e-03, 3.48304308e-03, -1.15698453e-02, 1.81123706e-02, 2.32361128e-02, 4.41750444e-02, 1.81231240e-03, 3.92334219e-02, 7.23494533e-03, 4.80665370e-03, 7.04111798e-03, 1.43040137e-02, -7.62997264e-03])Verism
I tried to convert the data type to float, but the result is still the sameVerism
What else do you expect the second graph to look like?Gotama
Both plots are correct in the sense that they are both normalized (= the area of the bars sums up to 1). Probably you just have a different idea of what you'd expect the density to be in mind? In that case I suppose this problem can only be solved if you tell people what that would be.Oxcart
@Oxcart I assume that he expects to see the probability value for each bar on the vertical axis. In the bottom picture, you can see the value changes from 0 to 40. I suspect that he is expecting it to vary between 0 and 1.Congeal
I'm having the same problem, I'm expecting the values to vary between 0 and 1. Can someone explain in an answer what are the limits given by the Matplotlib graph?Gearalt
Running into the same problem. The y-axis label should be the density of each bar.Indochina
Does this answer your question? pylab.hist(data, normed=1). Normalization seems to work incorrectMetronome
I
9

This is a known issue in Matplotlib.

As stated in Bug Report: The density flag in pyplot.hist() does not work correctly

When density = False, the histogram plot would have counts on the Y-axis. But when density = True, the Y-axis does not mean anything useful. I think a better implementation would plot the PDF as the histogram when density = True.

The developers view this as a feature not a bug since it maintains compatibility with numpy. They have closed several the bug reports about it already with since it is working as intended. Creating even more confusion the example on the matplotlib site appears to show this feature working with the y-axis being assigned a meaningful value.

What you want to do with matplotlib is reasonable but matplotlib will not let you do it that way.

Indochina answered 28/8, 2020 at 2:58 Comment(0)
H
2

It is not a bug. Area of the bars equal to 1. Numbers only seem strange because your bin sizes are small

Hilleary answered 26/10, 2020 at 0:23 Comment(0)
R
1

Since this isn't resolved; based on @user14518925's response which is actually correct, this is treating bin width as an actual valid number whereas from my understanding you want each bin to have a width of 1 such that the sum of frequencies is 1. More succinctly, what you're seeing is:

\sum_{i}y_{i}\times\text{bin size} =1

Whereas what you want is:

\sum_{i}y_{i} =1

therefore, all you really need to change is the tick labels on the y-axis. One way to this is to disable the density option :

density = false

and instead divide by the total sample size as such (shown in your example):

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19680801)

# example data
mu = 0 # mean of distribution
sigma = 0.0000625  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

fig = plt.figure()
plt.hist(x, 50, density=False)
locs, _ = plt.yticks() 
print(locs)
plt.yticks(locs,np.round(locs/len(x),3))
plt.show()
Ralfston answered 26/3, 2022 at 0:17 Comment(0)
D
1

At first I also thought that this is an issue. I thought that the tick values shown in the y-axis should not be greater than 1. This means the frequency in that bin is greater than the total frequency which simply doesn't make any sense.

After thinking for a while, I understood what's really happening. So what we are expecting it to return is the Probability Distribution Function which is nothing but the (Observed frequency of a bin) / (Total frequency).

But what Matplotlib returns as density is (Observed frequency of a bin) / (Total frequency * length of each bin). If length of each bin is quite less than 1, then density for that particular bin can go beyond 1. But the total area under the histogram remains 1. As, sum(density*bin_length) for all bins = sum(each frequency)/(Total Frequency) = 1.

So the values you are getting are absolutely fine and make sense too.

Donnadonnamarie answered 29/11, 2023 at 19:34 Comment(0)
E
0

Another approach, besides that of tvbc, is to change the yticks on the plot.

import matplotlib.pyplot as plt
import numpy as np

steps = 10
bins = np.arange(0, 101, steps)
data = np.random.random(100000) * 100

plt.hist(data, bins=bins, density=True)
yticks = plt.gca().get_yticks()
plt.yticks(yticks, np.round(yticks * steps, 2))
plt.show()
Exanthema answered 14/12, 2022 at 21:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.