pylab.hist(data, normed=1). Normalization seems to work incorrect
Asked Answered
S

7

47

I'm trying to create a histogram with argument normed=1

For instance:

import pylab

data = ([1,1,2,3,3,3,3,3,4,5.1])    
pylab.hist(data, normed=1)
pylab.show()

I expected that the sum of the bins would be 1. But instead, one of the bin is bigger then 1. What this normalization did? And how to create a histogram with such normalization that the integral of the histogram would be equal 1?

enter image description here

Saporous answered 31/3, 2011 at 9:51 Comment(2)
Also try pylab.hist(data, bins=5, range=(1, 6), normed=1). This will result in a bin width of 1.Hom
"sum of the bins would be 1. But instead, one of the bin is bigger then 1" -> this is not a contradiction!Ecbatana
R
66

See my other post for how to make the sum of all bins in a histogram equal to one: https://mcmap.net/q/209393/-plot-a-histogram-such-that-bar-heights-sum-to-1-probability

Copy & Paste:

weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)

where myarray contains your data

Reword answered 6/5, 2013 at 13:24 Comment(5)
This is the best way to do it if you're doing frequency histograms!Microreader
FYI, make sure to keep normed=0 if you are using the above method.Borderline
Worked perfectly in conjunction with the formatter in this example (which uses normed instead of weights; weights works regardless of bin size, whereas normed/density requires bins of size unity, from the documentation).Encomium
amazing! Best wayHabana
great, practical solutionIdolism
G
24

According to documentation normed: If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. This is from numpy doc, but should be the same for pylab.

In []: data= array([1,1,2,3,3,3,3,3,4,5.1])
In []: counts, bins= histogram(data, normed= True)
In []: counts
Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22,  0.,  0.,  0.244,  0.,  0.244])
In []: sum(counts* diff(bins))
Out[]: 0.99999999999999989

So simply normalization is done according to the documentation like:

In []: counts, bins= histogram(data, normed= False)
In []: counts
Out[]: array([2, 0, 1, 0, 5, 0, 0, 1, 0, 1])
In []: counts_n= counts/ sum(counts* diff(bins))
In []: counts_n
Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22 ,  0.,  0.,  0.244,  0.,  0.244])
Geodynamics answered 31/3, 2011 at 10:1 Comment(8)
Yep, I've read it already. The sum seems to be correct. But look at the histogram, the 3rd element is 1.215122. Why is it bigger than 1?Saporous
@smirnoffs: What is your argument that it can't be higher than 1? ThanksGeodynamics
@Geodynamics Normalized histogram, as I understood it, is a probability density function. Probability can't be more than 1.Saporous
@smirnoffs: can you provide some links to backup your definition of normalized histogram? FWIW it's totally obvious from the docs how the normalization works. counts* diff(bins) gives you what you are looking for. ThanksGeodynamics
Probability densities can be anything non-negative as long as the integral (not the sum) over the range is equal to 1.Frants
@robert-kern You are probably right. Might be it's my misunderstanding. What exactly the width of the bin means in that case?Saporous
The sum of the areas of the bins should be one. Each bin has a width less than 1/2 in this picture, so the area of the potentially offending bin is less than .5 * 1.215122 = .607561 of area which is fine.Changeable
This answer refers to numpy.histogram, rather than pylab.histRockbound
B
9

I think you are confusing bin heights with bin contents. You need to add the contents of each bin, i.e. height*width for all bins. That should = 1.

Bayles answered 31/3, 2011 at 10:41 Comment(1)
So to clarify for all, what would you put as the y axis label on the OP's histogram?Intertexture
R
8

What this normalization did?

In order to normalize a sequence, you have to take into account the bin size. According to the documentation , the default number of bin is 10. Consequently, the bin size is (data.max() - data.min() )/10, that is 0.41. If normed=1, then the heights of the bar is such that the sum, multiplied by 0.41, gives 1. This is what happens when you integrate.

And how to create a histogram with such normalization that the integral of the histogram would be equal 1?

I think that you want the sum of the histogram, not its integral, to be equal to 1. In this case the quickest way seems:

h = plt.hist(data)
norm = sum(data)
h2 = [i/norm for i in h[0]]
plt.bar(h[1],h2)
Rockbound answered 2/9, 2015 at 14:55 Comment(0)
M
5

I had the same problem, and while solving it another problem came up: how to plot the the normalised bin frequences as percentages with ticks on rounded values. I'm posting it here in case it is useful for anyone. In my example I chose 10% (0.1) as the maximum value for the y axis, and 10 steps (one from 0% to 1%, one from 1% to 2%, and so on). The trick is to set the ticks at the data counts (which are the output list n of the plt.hist) that will next be transformed into percentages using the FuncFormatter class. Here's what I did:

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

fig, ax = plt.subplots()

# The required parameters
num_steps = 10
max_percentage = 0.1
num_bins = 40

# Calculating the maximum value on the y axis and the yticks
max_val = max_percentage * len(data)
step_size = max_val / num_steps
yticks = [ x * step_size for x in range(0, num_steps+1) ]
ax.set_yticks( yticks )
plt.ylim(0, max_val)

# Running the histogram method
n, bins, patches = plt.hist(data, num_bins)

# To plot correct percentages in the y axis     
to_percentage = lambda y, pos: str(round( ( y / float(len(data)) ) * 100.0, 2)) + '%'
plt.gca().yaxis.set_major_formatter(FuncFormatter(to_percentage))

plt.show()

Plots

Before normalisation: the y axis unit is number of samples within the bin intervals in the x axis: Before normalisation: the y axis unit is number of samples within the bin intervals in the x axis

After normalisation: the y axis unit is frequency of the bin values as a percentage over all the samples After normalisation: the y axis unit is frequency of the bin values as a percentage over all the samples

Misinform answered 18/2, 2014 at 17:21 Comment(0)
S
4

There is also numpy.histogram. If you set density=True, the output will be normalized.

normed : bool, optional

This keyword is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. Use the density keyword instead. If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that this latter behavior is known to be buggy with unequal bin widths; use density instead.

density : bool, optional

If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. Overrides the normed keyword if given.

Semantics answered 11/2, 2014 at 8:46 Comment(0)
E
2

Your expectations are wrong

The sum of the bins height times its width equals to one. Or, as you said correctly, the integral has to be one, not the function you are integrating about.

It's like this: probability (as in "the probability that the person is between 20 and 40 years old is ...%") is the integral ("from 20 to 40 years old") over the probability density. The bins height shows the probability density whereas the width times height shows the probability (you integrate the constant assumed function, height of bin, from beginning of bin to end of bin) for a certain point to be in this bin. The height itself is the density and not a probability. It is a probability per width which can be higher then one of course.

Simple example: imagine a probability density function from 0 to 1 that has value 0 from 0 to 0.9. What could the function possibly be between 0.9 and 1? If you integrate over it, try it out. It will be higher then 1.

Btw: from a rough guess, the sum of height times width of your hist seems to yield roughly 1, doesn't it?

Ecbatana answered 29/7, 2017 at 16:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.