How to plot a histogram using Matplotlib in Python with a list of data?
Asked Answered
L

6

172

If I have a list of y-values that correspond to bar height and a list of x-value strings, how do I plot a histogram using matplotlib.pyplot.hist?

Related: matplotlib.pyplot.bar.

Lely answered 18/10, 2015 at 21:46 Comment(1)
If you plotted a histogram using .bar but it doesn’t look correct, then probably the bars are too wide. See this answer to adjust the bar width.Frons
T
295

If you want a histogram, you don't need to attach any 'names' to x-values because:

  • on x-axis you will have data bins
  • on y-axis counts (by default) or frequencies (density=True)
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

np.random.seed(42)
x = np.random.normal(size=1000)

plt.hist(x, density=True, bins=30)  # density=False would make counts
plt.ylabel('Probability')
plt.xlabel('Data');

enter image description here

Note, the number of bins=30 was chosen arbitrarily, and there is Freedman–Diaconis rule to be more scientific in choosing the "right" bin width:

![enter image description here , where IQR is Interquartile range and n is total number of datapoints to plot

So, according to this rule one may calculate number of bins as:

q25, q75 = np.percentile(x, [25, 75])
bin_width = 2 * (q75 - q25) * len(x) ** (-1/3)
bins = round((x.max() - x.min()) / bin_width)
print("Freedman–Diaconis number of bins:", bins)
plt.hist(x, bins=bins);

Freedman–Diaconis number of bins: 82

enter image description here

And finally you can make your histogram a bit fancier with PDF line, titles, and legend:

import scipy.stats as st

plt.hist(x, density=True, bins=82, label="Data")
mn, mx = plt.xlim()
plt.xlim(mn, mx)
kde_xs = np.linspace(mn, mx, 300)
kde = st.gaussian_kde(x)
plt.plot(kde_xs, kde.pdf(kde_xs), label="PDF")
plt.legend(loc="upper left")
plt.ylabel("Probability")
plt.xlabel("Data")
plt.title("Histogram");

enter image description here

If you're willing to explore other opportunities, there is a shortcut with seaborn:

# !pip install seaborn
import seaborn as sns
sns.displot(x, bins=82, kde=True);

enter image description here

Now back to the OP.

If you have limited number of data points, a bar plot would make more sense to represent your data. Then you may attach labels to x-axis:

x = np.arange(3)
plt.bar(x, height=[1,2,3])
plt.xticks(x, ['a','b','c']);

enter image description here

Thicken answered 18/10, 2015 at 22:14 Comment(3)
@Toad22222 This is an excerpt from Ipython notebook cell. Try to execute it without semicolon and see the difference. All the code snippets I post on SO run perfectly on my computer.Thicken
If you are wondering about the semi-colon used by Sergey, see here and #16 here for how semi-colon is used in Jupyter notebooks (formerly IPython notebooks) cells when plotting to suppress the text about the plot object.Campanulate
If you are getting OverflowError: cannot convert float infinity to integer just change .25 to 25 and .75 to 75Womanizer
T
27

If you haven't installed matplotlib yet just try the command.

> pip install matplotlib

Library import

import matplotlib.pyplot as plot

The histogram data:

plot.hist(weightList,density=1, bins=20) 
plot.axis([50, 110, 0, 0.06]) 
#axis([xmin,xmax,ymin,ymax])
plot.xlabel('Weight')
plot.ylabel('Probability')

Display histogram

plot.show()

And the output is like :

enter image description here

Tokharian answered 21/5, 2018 at 1:43 Comment(1)
The plot.axis([50, 110, 0, 0.06])' line is useless for the example. Besides, as it hard codes the area of the plot to show, if your data does not fit entirely inside it you may be confused why it doesn't show correctly.Stay
C
7

Though the question appears to be demanding plotting a histogram using matplotlib.hist() function, it can arguably be not done using the same as the latter part of the question demands to use the given probabilities as the y-values of bars and given names(strings) as the x-values.

I'm assuming a sample list of names corresponding to given probabilities to draw the plot. A simple bar plot serves the purpose here for the given problem. The following code can be used:

import matplotlib.pyplot as plt
probability = [0.3602150537634409, 0.42028985507246375, 
  0.373117033603708, 0.36813186813186816, 0.32517482517482516, 
  0.4175257731958763, 0.41025641025641024, 0.39408866995073893, 
  0.4143222506393862, 0.34, 0.391025641025641, 0.3130841121495327, 
  0.35398230088495575]
names = ['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7', 'name8', 'name9',
'name10', 'name11', 'name12', 'name13'] #sample names
plt.bar(names, probability)
plt.xticks(names)
plt.yticks(probability) #This may be included or excluded as per need
plt.xlabel('Names')
plt.ylabel('Probability')
Campman answered 22/2, 2020 at 12:42 Comment(0)
M
6

This is an old question but none of the previous answers has addressed the real issue, i.e. that fact that the problem is with the question itself.

First, if the probabilities have been already calculated, i.e. the histogram aggregated data is available in a normalized way then the probabilities should add up to 1. They obviously do not and that means that something is wrong here, either with terminology or with the data or in the way the question is asked.

Second, the fact that the labels are provided (and not intervals) would normally mean that the probabilities are of categorical response variable - and a use of a bar plot for plotting the histogram is best (or some hacking of the pyplot's hist method), Shayan Shafiq's answer provides the code.

However, see issue 1, those probabilities are not correct and using bar plot in this case as "histogram" would be wrong because it does not tell the story of univariate distribution, for some reason (perhaps the classes are overlapping and observations are counted multiple times?) and such plot should not be called a histogram in this case.

Histogram is by definition a graphical representation of the distribution of univariate variable (see Histogram | NIST/SEMATECH e-Handbook of Statistical Methods & Histogram | Wikipedia) and is created by drawing bars of sizes representing counts or frequencies of observations in selected classes of the variable of interest. If the variable is measured on a continuous scale those classes are bins (intervals). Important part of histogram creation procedure is making a choice of how to group (or keep without grouping) the categories of responses for a categorical variable, or how to split the domain of possible values into intervals (where to put the bin boundaries) for continuous type variable. All observations should be represented, and each one only once in the plot. That means that the sum of the bar sizes should be equal to the total count of observation (or their areas in case of the variable widths, which is a less common approach). Or, if the histogram is normalised then all probabilities must add up to 1.

If the data itself is a list of "probabilities" as a response, i.e. the observations are probability values (of something) for each object of study then the best answer is simply plt.hist(probability) with maybe binning option, and use of x-labels already available is suspicious.

Then bar plot should not be used as histogram but rather simply

import matplotlib.pyplot as plt
probability = [0.3602150537634409, 0.42028985507246375, 
  0.373117033603708, 0.36813186813186816, 0.32517482517482516, 
  0.4175257731958763, 0.41025641025641024, 0.39408866995073893, 
  0.4143222506393862, 0.34, 0.391025641025641, 0.3130841121495327, 
  0.35398230088495575]
plt.hist(probability)
plt.show()

with the results

enter image description here

matplotlib in such case arrives by default with the following histogram values

(array([1., 1., 1., 1., 1., 2., 0., 2., 0., 4.]),
 array([0.31308411, 0.32380469, 0.33452526, 0.34524584, 0.35596641,
        0.36668698, 0.37740756, 0.38812813, 0.39884871, 0.40956928,
        0.42028986]),
 <a list of 10 Patch objects>)

the result is a tuple of arrays, the first array contains observation counts, i.e. what will be shown against the y-axis of the plot (they add up to 13, total number of observations) and the second array are the interval boundaries for x-axis.

One can check they they are equally spaced,

x = plt.hist(probability)[1]
for left, right in zip(x[:-1], x[1:]):
  print(left, right, right-left)

enter image description here

Or, for example for 3 bins (my judgment call for 13 observations) one would get this histogram

plt.hist(probability, bins=3)

enter image description here

with the plot data "behind the bars" being

enter image description here

The author of the question needs to clarify what is the meaning of the "probability" list of values - is the "probability" just a name of the response variable (then why are there x-labels ready for the histogram, it makes no sense), or are the list values the probabilities calculated from the data (then the fact they do not add up to 1 makes no sense).

Myxomatosis answered 5/6, 2020 at 22:44 Comment(1)
You NAILED it! The question is flawed. Good catch.Libbielibbna
F
5

This is a very round-about way of doing it but if you want to make a histogram where you already know the bin values but dont have the source data, you can use the np.random.randint function to generate the correct number of values within the range of each bin for the hist function to graph, for example:

import numpy as np
import matplotlib.pyplot as plt

data = [np.random.randint(0, 9, *desired y value*), np.random.randint(10, 19, *desired y value*), etc..]
plt.hist(data, histtype='stepfilled', bins=[0, 10, etc..])

as for labels you can align x ticks with bins to get something like this:

#The following will align labels to the center of each bar with bin intervals of 10
plt.xticks([5, 15, etc.. ], ['Label 1', 'Label 2', etc.. ])
Ferry answered 3/5, 2016 at 2:34 Comment(0)
F
1

TL;DR: If you have raw data, you probably need hist(); if you have processed data, you probably need bar().

For a 1D array (or a flat list) of data, plt.hist is just a wrapper around np.histogram and plt.bar. In particular, since it's often the case that hist ends up drawing a lot of bars (which correspond to frequency in each bin) compared to bar, bar widths are adjusted by np.diff(bins) (source code). The main "functionality" of hist can be abbreviated as follows:

height, bins = np.histogram(data, bins)    # compute histogram
width = np.diff(bins)                      # calculate bar width
boffset = 0.5 * width                      # calculate bar position offset
plt.bar(bins[:-1]+boffset, height, width)  # plot bar-chart

So if the input is

a list of y-values that correspond to bar height

then hist most likely won't behave as you would expect it to because it bins that raw input and counts the number of data points in each bin, i.e. it would process your data even further "thinking" it's raw input. If you have a list of probability values, i.e. height of the bars in a histogram, then you can go ahead and plot a bar-chart instead.


An example may be illustrative. Say, you have a raw data with 1000 data points.

raw_data = np.random.default_rng(0).normal(size=1000)
raw_data.shape   # (1000,)

To plot its histogram, we need to specify the number of bins (Sergey's answer includes a way to calculate the correct number of bins). Let's plot raw_data with 20 bins (which means we have a bar-chart with 20 bars).

counts, bin_edges, *_ = plt.hist(raw_data, bins=20)

result

However, if you already have the counts (or frequencies or bar heights) and bin edges, like:

counts = [2, 0, 4, 3, 9, 13, 34, 68, 88, 131, 149, 128, 124, 
          95, 71, 40, 25, 9, 5, 2]

bin_edges = [-3.9, -3.55, -3.2, -2.85, -2.51, -2.16, -1.81, 
             -1.46, -1.11, -0.76, -0.42, -0.07, 0.28, 0.63, 
             0.98, 1.32, 1.67, 2.02, 2.37, 2.72, 3.07]

then instead of hist, use bar instead; simply plotting like plt.bar(bin_edges[:-1], counts) works if there are very few bars, i.e. number of bins is low. But if there are a lot of bars, this would not plot a very accurate histogram. We need to adjust the bar widths (like in the source code) to create a bar-chart that matches the plt.hist call on the raw data:

width = np.diff(bin_edges)                           # bar widths
boffset = 0.5 * width                                # bar position offsets
plt.bar(bin_edges[:-1]+boffset, counts, width)       # bar-chart

It's left to the reader to verify that this plt.bar call with the adjusted bar widths creates the same figure as created by the plt.hist call (on the raw data) above.

Frons answered 5/3 at 0:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.