How to get the cumulative distribution function with NumPy?
Asked Answered
S

7

43

I want to create a CDF with NumPy, my code is the next:

histo = np.zeros(4096, dtype = np.int32)
for x in range(0, width):
   for y in range(0, height):
      histo[data[x][y]] += 1
      q = 0 
   cdf = list()
   for i in histo:
      q = q + i
      cdf.append(q)

I am walking by the array but take a long time the program execution. There is a built function with this feature, isn't?

Samhita answered 17/5, 2012 at 17:44 Comment(1)
Also see empirical distribution functionPickford
P
27

I'm not really sure what your code is doing, but if you have hist and bin_edges arrays returned by numpy.histogram you can use numpy.cumsum to generate a cumulative sum of the histogram contents.

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])
Pungent answered 17/5, 2012 at 19:15 Comment(5)
However, this introduces a binning step that would not be necessary for a cumulative distribution.Inexcusable
"This keyword, normed is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. "There is a bug in the code if bin is not in [0,1]. Add x=np.cumsum(hist); x=(x - x.min()) / x.ptp()Pall
@Inexcusable Exactly. Any better solution for this?Hadria
@becko Dan's reply above contains both a histogram-based and an "exact" solution ("method 2").Inexcusable
Ah yes I missed that. Thanks @InexcusableHadria
C
107

Using a histogram is one solution but it involves binning the data. This is not necessary for plotting a CDF of empirical data. Let F(x) be the count of how many entries are less than x then it goes up by one, exactly where we see a measurement. Thus, if we sort our samples then at each point we increment the count by one (or the fraction by 1/N) and plot one against the other we will see the "exact" (i.e. un-binned) empirical CDF.

A following code sample demonstrates the method

import numpy as np
import matplotlib.pyplot as plt

N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)

plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()

It outputs the following

enter image description here

Clemmer answered 26/5, 2015 at 13:33 Comment(2)
as per numpy.histogram docs : normed is equivalent to the density argument, but produces incorrect results for unequal bin widths. Changed in version 1.15.0: DeprecationWarnings are actually emitted.Solitary
How would you deal with exact duplicate values in Z?Pickford
P
27

I'm not really sure what your code is doing, but if you have hist and bin_edges arrays returned by numpy.histogram you can use numpy.cumsum to generate a cumulative sum of the histogram contents.

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])
Pungent answered 17/5, 2012 at 19:15 Comment(5)
However, this introduces a binning step that would not be necessary for a cumulative distribution.Inexcusable
"This keyword, normed is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. "There is a bug in the code if bin is not in [0,1]. Add x=np.cumsum(hist); x=(x - x.min()) / x.ptp()Pall
@Inexcusable Exactly. Any better solution for this?Hadria
@becko Dan's reply above contains both a histogram-based and an "exact" solution ("method 2").Inexcusable
Ah yes I missed that. Thanks @InexcusableHadria
B
7

update for numpy version 1.9.0. user545424's answer does not work in 1.9.0. This works:

>>> import numpy as np
>>> arr = np.random.randint(0,10,100)
>>> hist, bin_edges = np.histogram(arr, density=True)
>>> hist = array([ 0.16666667,  0.15555556,  0.15555556,  0.05555556,  0.08888889,
    0.08888889,  0.07777778,  0.04444444,  0.18888889,  0.08888889])
>>> hist
array([ 0.1       ,  0.11111111,  0.11111111,  0.08888889,  0.08888889,
    0.15555556,  0.11111111,  0.13333333,  0.1       ,  0.11111111])
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> np.diff(bin_edges)
array([ 0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9])
>>> np.diff(bin_edges)*hist
array([ 0.09,  0.1 ,  0.1 ,  0.08,  0.08,  0.14,  0.1 ,  0.12,  0.09,  0.1 ])
>>> cdf = np.cumsum(hist*np.diff(bin_edges))
>>> cdf
array([ 0.15,  0.29,  0.43,  0.48,  0.56,  0.64,  0.71,  0.75,  0.92,  1.  ])
>>>
Biogeography answered 21/11, 2014 at 18:48 Comment(1)
user12287, I feel weird editing other people's answers. Besides, different answers for different versions.Biogeography
E
5

To complement Dan's solution. In the case where there are several identical values in your sample, you can use numpy.unique :

Z = np.array([1,1,1,2,2,4,5,6,6,6,7,8,8])
X, F = np.unique(Z, return_index=True)
F=F/X.size

plt.plot(X, F)
Eberto answered 26/8, 2015 at 15:8 Comment(4)
That gives you values of F that are greater than 1. Perhaps you meant to use F = F / float(F.max()) (also bear in mind that integer division would cause problems for people using Python 2x).Gametophore
This answer is old, thank you for your comments and answers. I have seen in each answer my rudimentary approach from three years ago.Samhita
@Eberto this is not quite correct since it should go up by more than 1/N for the entries which are there more than once. You're correct my solution will only be correct for the last one of such occurances but it will plot correctly.Clemmer
In principle you are using counts, but python uses zero based indexing in F, so perhaps you meant (F + 1) / (F[-1] + 1)Cayuga
F
4

The existing answers either resort to using a histogram, or don't handle duplicate values nicely/correctly (either ignoring duplicate values, or yielding a CDF that contains multiple y-values for the same x-value). I suggest the following method:

x, CDF_counts = np.unique(data, return_counts = True)
y = np.cumsum(CDF_counts)/np.sum(CDF_counts)
Ferminafermion answered 23/2, 2023 at 14:6 Comment(0)
M
1

Minor improvement to @Dan's "Exact" #2 method. I believe the ecdf of the first observation should not be 0, the last one should be 1, also eCDFs are often visualized as step functions (all three are mostly irrelevant for large n).

There was an unanswered question about duplicates, the matplotlib visualize them well, but here is a way to remove them:

x = np.array([3, 3, 3.5, 4, 6, 0, 0.5, 1, 1, 2, 2.5])
# x = np.random.normal(size = 100)

x = np.sort(x)
n = x.shape[0]

# original
y = np.arange(n)/n
plt.plot(x, y, label='original')
plt.plot(x, y, '.', color='tab:red', label='original')

# step (0, 1]
y_step = np.arange(1,n+1)/n
plt.step(x, y_step, where='post', label='step')

# no duplicates
x_unique, inds = np.unique(x, return_index=True)
y_unique = [y_[-1] for y_ in  np.split(y_step, inds[1:])]
plt.step(x_unique, y_unique, '--', where='post', label='step (unique)')
plt.plot(x_unique, y_unique, '.', color='tab:brown', label='step (unique)')

plt.ylim(-0.1, 1.1)
plt.legend()

enter image description here

Melitamelitopol answered 10/2, 2024 at 12:19 Comment(0)
O
-3

I am not sure if there is a ready-made answer, the exact thing to do is to define a function like:

def _cdf(x,data):
    return(sum(x>data))

This will be pretty fast.

Olen answered 21/9, 2016 at 16:55 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.