Plotting a histogram in Pandas with very heavy-tailed data

Asked 15/8, 2014 at 0:57 Answered 17/3, 2023 at 6:57

Solved python matplotlib pandas histogram jupyter-notebook

I am often working with data that has a very 'long tail'. I want to plot histograms to summarize the distribution, but when I try to using pandas I wind up with a bar graph that has one giant visible bar and everything else invisible.

Here is an example of the series I am working with. Since it's very long, I used value_counts() so it will fit on this page.

In [10]: data.value_counts.sort_index()

Out[10]:
0          8012
25         3710
100       10794
200       11718
300        2489
500        7631
600          34
700         115
1000       3099
1200       1766
1600         63
2000       1538
2200         41
2500        208
2700       2138
5000        515
5500        201
8800         10
10000        10
10900       465
13000         9
16200        74
20000       518
21500        65
27000        64
53000        82
56000         1
106000       35
530000        3

I'm guessing that the answer involves binning the less common results into larger groups somehow (53000, 56000, 106000, and 53000 into one group of >50000, etc.), and also changing the y index to represent percentages of the occurrence rather than the absolute number. However, I don't understand how I would go about doing that automatically.

Britanybritches answered 15/8, 2014 at 0:57 Comment(1)

histogram-with-colored-tails including binning – Acculturation 10/4 at 12:17

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np


mydict = {0: 8012,25: 3710,100: 10794,200: 11718,300: 2489,500: 7631,600: 34,700: 115,1000: 3099,1200: 1766,1600: 63,2000: 1538,2200: 41,2500: 208,2700: 2138,5000: 515,5500: 201,8800: 10,10000: 10,10900: 465,13000: 9,16200: 74,20000: 518,21500: 65,27000: 64,53000: 82,56000: 1,106000: 35,530000: 3}
mylist = []

for key in mydict:
for e in range(mydict[key]):
    mylist.insert(0,key)

df = pd.DataFrame(mylist,columns=['value'])
df2 = df[df.value <= 5000]

Plot as a bar:

fig = df.value.value_counts().sort_index().plot(kind="bar")
plt.savefig("figure.png")

bar

As a histogram (limited to values 5000 & under which is >97% of your data): I like using linspace to control buckets.

df2 = df[df.value <= 5000]
df2.hist(bins=np.linspace(0,5000,101))
plt.savefig('hist1')

enter image description here

EDIT: Changed np.linspace(0,5000,100) to np.linspace(0,5000,101) & updated histogram.

Lubricator answered 15/8, 2014 at 6:12 Comment(1)

I'm not exactly sure how I didn't stumble onto just trying a plain-old-bar graph on the value_counts(). I guess I'll file this one under "trying to out-smart myself". Thanks. – Britanybritches 15/8, 2014 at 12:48

Usually, heavy tail distributions ends with a power-law tail, as e.g. the Pareto distribution. In such cases, a powerfull representation would be a log-log plot. This is quite easy to implement in python, see e.g.

Note that withdrawing some values may be an inneficient way to see the power-law distribution.

Think also considering the Pareto analysis of your data.

In case you're interested in power-law distributions, you can read more on the fact that categorical datas are inherently power-law by construction, since they can not be sorted, a result by Vitold Belevitch from 1959.

Courtmartial answered 17/3, 2023 at 6:57 Comment(0)

Recommended topics

Hot tags