pylab.hist(data, normed=1). Normalization seems to work incorrect

Asked 31/3, 2011 at 9:51 Answered 29/7, 2017 at 16:14

I'm trying to create a histogram with argument normed=1

For instance:

import pylab

data = ([1,1,2,3,3,3,3,3,4,5.1])    
pylab.hist(data, normed=1)
pylab.show()

I expected that the sum of the bins would be 1. But instead, one of the bin is bigger then 1. What this normalization did? And how to create a histogram with such normalization that the integral of the histogram would be equal 1?

enter image description here

Saporous answered 31/3, 2011 at 9:51 Comment(2)

Also try pylab.hist(data, bins=5, range=(1, 6), normed=1). This will result in a bin width of 1. – Hom 31/3, 2011 at 11:22

"sum of the bins would be 1. But instead, one of the bin is bigger then 1" -> this is not a contradiction! – Ecbatana 9/11, 2021 at 12:57

See my other post for how to make the sum of all bins in a histogram equal to one: https://mcmap.net/q/209393/-plot-a-histogram-such-that-bar-heights-sum-to-1-probability

Copy & Paste:

weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)

where myarray contains your data

Reword answered 6/5, 2013 at 13:24 Comment(5)

This is the best way to do it if you're doing frequency histograms! – Microreader 26/4, 2014 at 10:42

FYI, make sure to keep normed=0 if you are using the above method. – Borderline 24/1, 2015 at 13:57

Worked perfectly in conjunction with the formatter in this example (which uses normed instead of weights; weights works regardless of bin size, whereas normed/density requires bins of size unity, from the documentation). – Encomium 11/2, 2016 at 22:21

amazing! Best way – Habana 1/3, 2018 at 17:2

great, practical solution – Idolism 12/1, 2023 at 19:20

According to documentation normed: If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. This is from numpy doc, but should be the same for pylab.

In []: data= array([1,1,2,3,3,3,3,3,4,5.1])
In []: counts, bins= histogram(data, normed= True)
In []: counts
Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22,  0.,  0.,  0.244,  0.,  0.244])
In []: sum(counts* diff(bins))
Out[]: 0.99999999999999989

So simply normalization is done according to the documentation like:

In []: counts, bins= histogram(data, normed= False)
In []: counts
Out[]: array([2, 0, 1, 0, 5, 0, 0, 1, 0, 1])
In []: counts_n= counts/ sum(counts* diff(bins))
In []: counts_n
Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22 ,  0.,  0.,  0.244,  0.,  0.244])

Geodynamics answered 31/3, 2011 at 10:1 Comment(8)

Yep, I've read it already. The sum seems to be correct. But look at the histogram, the 3rd element is 1.215122. Why is it bigger than 1? – Saporous 31/3, 2011 at 10:6

@smirnoffs: What is your argument that it can't be higher than 1? Thanks – Geodynamics 31/3, 2011 at 10:18

@Geodynamics Normalized histogram, as I understood it, is a probability density function. Probability can't be more than 1. – Saporous 31/3, 2011 at 12:32

@smirnoffs: can you provide some links to backup your definition of normalized histogram? FWIW it's totally obvious from the docs how the normalization works. counts* diff(bins) gives you what you are looking for. Thanks – Geodynamics 31/3, 2011 at 13:35

Probability densities can be anything non-negative as long as the integral (not the sum) over the range is equal to 1. – Frants 31/3, 2011 at 15:50

@robert-kern You are probably right. Might be it's my misunderstanding. What exactly the width of the bin means in that case? – Saporous 1/4, 2011 at 6:43

The sum of the areas of the bins should be one. Each bin has a width less than 1/2 in this picture, so the area of the potentially offending bin is less than .5 * 1.215122 = .607561 of area which is fine. – Changeable 28/9, 2011 at 20:14

This answer refers to numpy.histogram, rather than pylab.hist – Rockbound 2/9, 2015 at 10:59

I think you are confusing bin heights with bin contents. You need to add the contents of each bin, i.e. height*width for all bins. That should = 1.

Bayles answered 31/3, 2011 at 10:41 Comment(1)

So to clarify for all, what would you put as the y axis label on the OP's histogram? – Intertexture 28/6, 2019 at 1:3

What this normalization did?

In order to normalize a sequence, you have to take into account the bin size. According to the documentation , the default number of bin is 10. Consequently, the bin size is (data.max() - data.min() )/10, that is 0.41. If normed=1, then the heights of the bar is such that the sum, multiplied by 0.41, gives 1. This is what happens when you integrate.

And how to create a histogram with such normalization that the integral of the histogram would be equal 1?

I think that you want the sum of the histogram, not its integral, to be equal to 1. In this case the quickest way seems:

h = plt.hist(data)
norm = sum(data)
h2 = [i/norm for i in h[0]]
plt.bar(h[1],h2)

Rockbound answered 2/9, 2015 at 14:55 Comment(0)

I had the same problem, and while solving it another problem came up: how to plot the the normalised bin frequences as percentages with ticks on rounded values. I'm posting it here in case it is useful for anyone. In my example I chose 10% (0.1) as the maximum value for the y axis, and 10 steps (one from 0% to 1%, one from 1% to 2%, and so on). The trick is to set the ticks at the data counts (which are the output list n of the plt.hist) that will next be transformed into percentages using the FuncFormatter class. Here's what I did:

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

fig, ax = plt.subplots()

# The required parameters
num_steps = 10
max_percentage = 0.1
num_bins = 40

# Calculating the maximum value on the y axis and the yticks
max_val = max_percentage * len(data)
step_size = max_val / num_steps
yticks = [ x * step_size for x in range(0, num_steps+1) ]
ax.set_yticks( yticks )
plt.ylim(0, max_val)

# Running the histogram method
n, bins, patches = plt.hist(data, num_bins)

# To plot correct percentages in the y axis     
to_percentage = lambda y, pos: str(round( ( y / float(len(data)) ) * 100.0, 2)) + '%'
plt.gca().yaxis.set_major_formatter(FuncFormatter(to_percentage))

plt.show()

Plots

Before normalisation: the y axis unit is number of samples within the bin intervals in the x axis:

After normalisation: the y axis unit is frequency of the bin values as a percentage over all the samples

Misinform answered 18/2, 2014 at 17:21 Comment(0)

There is also numpy.histogram. If you set density=True, the output will be normalized.

normed : bool, optional

This keyword is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. Use the density keyword instead. If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that this latter behavior is known to be buggy with unequal bin widths; use density instead.

density : bool, optional

If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. Overrides the normed keyword if given.

Semantics answered 11/2, 2014 at 8:46 Comment(0)

Your expectations are wrong

The sum of the bins height times its width equals to one. Or, as you said correctly, the integral has to be one, not the function you are integrating about.

It's like this: probability (as in "the probability that the person is between 20 and 40 years old is ...%") is the integral ("from 20 to 40 years old") over the probability density. The bins height shows the probability density whereas the width times height shows the probability (you integrate the constant assumed function, height of bin, from beginning of bin to end of bin) for a certain point to be in this bin. The height itself is the density and not a probability. It is a probability per width which can be higher then one of course.

Simple example: imagine a probability density function from 0 to 1 that has value 0 from 0 to 0.9. What could the function possibly be between 0.9 and 1? If you integrate over it, try it out. It will be higher then 1.

Btw: from a rough guess, the sum of height times width of your hist seems to yield roughly 1, doesn't it?

Ecbatana answered 29/7, 2017 at 16:14 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Plots

Your expectations are wrong

Recommended topics

Hot tags