How to get the cumulative distribution function with NumPy?

S

7

43

I want to create a CDF with NumPy, my code is the next:

histo = np.zeros(4096, dtype = np.int32)
for x in range(0, width):
   for y in range(0, height):
      histo[data[x][y]] += 1
      q = 0 
   cdf = list()
   for i in histo:
      q = q + i
      cdf.append(q)

I am walking by the array but take a long time the program execution. There is a built function with this feature, isn't?

Samhita answered 17/5, 2012 at 17:44 Comment(1)

Also see empirical distribution function – Pickford 7/12, 2022 at 14:14

P

27

I'm not really sure what your code is doing, but if you have hist and bin_edges arrays returned by numpy.histogram you can use numpy.cumsum to generate a cumulative sum of the histogram contents.

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])

Pungent answered 17/5, 2012 at 19:15 Comment(5)

However, this introduces a binning step that would not be necessary for a cumulative distribution. – Inexcusable 20/2, 2014 at 9:32

"This keyword, normed is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. "There is a bug in the code if bin is not in [0,1]. Add x=np.cumsum(hist); x=(x - x.min()) / x.ptp() – Pall 23/6, 2016 at 19:29

@Inexcusable Exactly. Any better solution for this? – Hadria 15/9, 2022 at 21:21

@becko Dan's reply above contains both a histogram-based and an "exact" solution ("method 2"). – Inexcusable 17/9, 2022 at 11:23

Ah yes I missed that. Thanks @Inexcusable – Hadria 18/9, 2022 at 15:51

C

107

Using a histogram is one solution but it involves binning the data. This is not necessary for plotting a CDF of empirical data. Let F(x) be the count of how many entries are less than x then it goes up by one, exactly where we see a measurement. Thus, if we sort our samples then at each point we increment the count by one (or the fraction by 1/N) and plot one against the other we will see the "exact" (i.e. un-binned) empirical CDF.

A following code sample demonstrates the method

import numpy as np
import matplotlib.pyplot as plt

N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)

plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()

It outputs the following

enter image description here

Clemmer answered 26/5, 2015 at 13:33 Comment(2)

as per numpy.histogram docs : normed is equivalent to the density argument, but produces incorrect results for unequal bin widths. Changed in version 1.15.0: DeprecationWarnings are actually emitted. – Solitary 14/10, 2021 at 19:39

How would you deal with exact duplicate values in Z? – Pickford 7/12, 2022 at 13:54

P

27

I'm not really sure what your code is doing, but if you have hist and bin_edges arrays returned by numpy.histogram you can use numpy.cumsum to generate a cumulative sum of the histogram contents.

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])

Pungent answered 17/5, 2012 at 19:15 Comment(5)

However, this introduces a binning step that would not be necessary for a cumulative distribution. – Inexcusable 20/2, 2014 at 9:32

"This keyword, normed is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. "There is a bug in the code if bin is not in [0,1]. Add x=np.cumsum(hist); x=(x - x.min()) / x.ptp() – Pall 23/6, 2016 at 19:29

@Inexcusable Exactly. Any better solution for this? – Hadria 15/9, 2022 at 21:21

@becko Dan's reply above contains both a histogram-based and an "exact" solution ("method 2"). – Inexcusable 17/9, 2022 at 11:23

Ah yes I missed that. Thanks @Inexcusable – Hadria 18/9, 2022 at 15:51

B

7

update for numpy version 1.9.0. user545424's answer does not work in 1.9.0. This works:

>>> import numpy as np
>>> arr = np.random.randint(0,10,100)
>>> hist, bin_edges = np.histogram(arr, density=True)
>>> hist = array([ 0.16666667,  0.15555556,  0.15555556,  0.05555556,  0.08888889,
    0.08888889,  0.07777778,  0.04444444,  0.18888889,  0.08888889])
>>> hist
array([ 0.1       ,  0.11111111,  0.11111111,  0.08888889,  0.08888889,
    0.15555556,  0.11111111,  0.13333333,  0.1       ,  0.11111111])
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> np.diff(bin_edges)
array([ 0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9])
>>> np.diff(bin_edges)*hist
array([ 0.09,  0.1 ,  0.1 ,  0.08,  0.08,  0.14,  0.1 ,  0.12,  0.09,  0.1 ])
>>> cdf = np.cumsum(hist*np.diff(bin_edges))
>>> cdf
array([ 0.15,  0.29,  0.43,  0.48,  0.56,  0.64,  0.71,  0.75,  0.92,  1.  ])
>>>

Biogeography answered 21/11, 2014 at 18:48 Comment(1)

user12287, I feel weird editing other people's answers. Besides, different answers for different versions. – Biogeography 30/12, 2014 at 20:59

E

5

To complement Dan's solution. In the case where there are several identical values in your sample, you can use numpy.unique :

Z = np.array([1,1,1,2,2,4,5,6,6,6,7,8,8])
X, F = np.unique(Z, return_index=True)
F=F/X.size

plt.plot(X, F)

Eberto answered 26/8, 2015 at 15:8 Comment(4)

That gives you values of F that are greater than 1. Perhaps you meant to use F = F / float(F.max()) (also bear in mind that integer division would cause problems for people using Python 2x). – Gametophore 26/8, 2015 at 16:19

This answer is old, thank you for your comments and answers. I have seen in each answer my rudimentary approach from three years ago. – Samhita 26/8, 2015 at 17:35

@Eberto this is not quite correct since it should go up by more than 1/N for the entries which are there more than once. You're correct my solution will only be correct for the last one of such occurances but it will plot correctly. – Clemmer 26/8, 2015 at 19:20

In principle you are using counts, but python uses zero based indexing in F, so perhaps you meant (F + 1) / (F[-1] + 1) – Cayuga 8/10, 2020 at 22:42

F

4

The existing answers either resort to using a histogram, or don't handle duplicate values nicely/correctly (either ignoring duplicate values, or yielding a CDF that contains multiple y-values for the same x-value). I suggest the following method:

x, CDF_counts = np.unique(data, return_counts = True)
y = np.cumsum(CDF_counts)/np.sum(CDF_counts)

Ferminafermion answered 23/2, 2023 at 14:6 Comment(0)

M

1

Minor improvement to @Dan's "Exact" #2 method. I believe the ecdf of the first observation should not be 0, the last one should be 1, also eCDFs are often visualized as step functions (all three are mostly irrelevant for large n).

There was an unanswered question about duplicates, the matplotlib visualize them well, but here is a way to remove them:

x = np.array([3, 3, 3.5, 4, 6, 0, 0.5, 1, 1, 2, 2.5])
# x = np.random.normal(size = 100)

x = np.sort(x)
n = x.shape[0]

# original
y = np.arange(n)/n
plt.plot(x, y, label='original')
plt.plot(x, y, '.', color='tab:red', label='original')

# step (0, 1]
y_step = np.arange(1,n+1)/n
plt.step(x, y_step, where='post', label='step')

# no duplicates
x_unique, inds = np.unique(x, return_index=True)
y_unique = [y_[-1] for y_ in  np.split(y_step, inds[1:])]
plt.step(x_unique, y_unique, '--', where='post', label='step (unique)')
plt.plot(x_unique, y_unique, '.', color='tab:brown', label='step (unique)')

plt.ylim(-0.1, 1.1)
plt.legend()

Melitamelitopol answered 10/2, 2024 at 12:19 Comment(0)

O

-3

I am not sure if there is a ready-made answer, the exact thing to do is to define a function like:

def _cdf(x,data):
    return(sum(x>data))

This will be pretty fast.

Olen answered 21/9, 2016 at 16:55 Comment(0)

Recommended topics

Hot tags