Python Empirical distribution function (ecdf) implementation
Asked Answered
M

1

7

I am aware of statsmodels.tools.tools.ECDF but since the calculation of an empricial cumulative distribution function (ECDF) is pretty straight-forward and I want to minimise dependencies in my project, I want to code it manually.

In a given list() / np.array() Pandas.Series, the ECDF for each element can be calculated as given in Wikipedia:

enter image description here

I have the Pandas DataFrame ,dfser, below and I want to get the ecdf of the values column. My two one-liner solutions are given as well.

Is there a faster way to do this? Speed is important in my application.

# Note that in my case indices are unique identifiers so I cannot reset them.
import numpy as np
import pandas as pd

# all indices are unique, but there may be duplicate measurement values (that belong to different indices). 
dfser = pd.DataFrame({'group':['a','b','b','a','d','c','e','e','c','a','b','d','d','c','d','e','e','a'],
                      'values':[2.01899E-06, 1.12186E-07, 8.97467E-07, 2.91257E-06, 1.93733E-05, 
                                0.00017889, 0.000120963, 4.27643E-07, 3.33614E-07, 2.08352E-12,  
                                1.39478E-05, 4.28255E-08, 9.7619E-06, 8.51787E-09, 1.28344E-09, 
                                3.5063E-05, 0.01732035,2.08352E-12]},
                       index = [123, 532, 235, 645, 747, 856, 345, 245, 845, 248, 901, 712, 162, 126, 
                              198,748, 127,395]      )

# My 1st Solution - list comprehension
dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]

# My 2nd Solution - ranking
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
dfser
    group        values      ecdf  rank    ecdf_r
123     a  2.018990e-06  0.555556   9.0  0.555556
532     b  1.121860e-07  0.333333  13.0  0.333333
235     b  8.974670e-07  0.500000  10.0  0.500000
645     a  2.912570e-06  0.611111   8.0  0.611111
747     d  1.937330e-05  0.777778   5.0  0.777778
856     c  1.788900e-04  0.944444   2.0  0.944444
345     e  1.209630e-04  0.888889   3.0  0.888889
245     e  4.276430e-07  0.444444  11.0  0.444444
845     c  3.336140e-07  0.388889  12.0  0.388889
248     a  2.083520e-12  0.111111  17.5  0.083333
901     b  1.394780e-05  0.722222   6.0  0.722222
712     d  4.282550e-08  0.277778  14.0  0.277778
162     d  9.761900e-06  0.666667   7.0  0.666667
126     c  8.517870e-09  0.222222  15.0  0.222222
198     d  1.283440e-09  0.166667  16.0  0.166667
748     e  3.506300e-05  0.833333   4.0  0.833333
127     e  1.732035e-02  1.000000   1.0  1.000000
395     a  2.083520e-12  0.111111  17.5  0.083333
Milden answered 19/2, 2014 at 17:16 Comment(1)
So this is just quick because I don't have a lot of time to give you a full answer, but something like: np.arange(1, ser.size+1)/float(ser.size) will give you the same cdf that you calculate.Fervent
L
13

Since you are already using pandas I think it will be silly not to use some of its features:

In [15]:
import numpy as np
from numpy import *
sq=ser.value_counts()
sq.sort_index().cumsum()*1./len(sq)
Out[15]:
2.083520e-12    0.058824
1.283440e-09    0.117647
8.517870e-09    0.176471
4.282550e-08    0.235294
1.121860e-07    0.294118
3.336140e-07    0.352941
4.276430e-07    0.411765
8.974670e-07    0.470588
2.018990e-06    0.529412
2.912570e-06    0.588235
9.761900e-06    0.647059
1.394780e-05    0.705882
1.937330e-05    0.764706
3.506300e-05    0.823529
1.209630e-04    0.882353
1.788900e-04    0.941176
1.732035e-02    1.000000
dtype: float64

And speed comparison

In [19]:

%timeit sq.sort_index().cumsum()*1./len(sq)
1000 loops, best of 3: 344 µs per loop
In [18]:

%timeit ser.value_counts().sort_index().cumsum()*1./len(ser.value_counts())
1000 loops, best of 3: 1.58 ms per loop
In [17]:

%timeit [sum( ser <= x)/float(len(ser)) for x in ser]
100 loops, best of 3: 3.31 ms per loop

If the values are all unique, the ser.value_counts() is no longer needed. That part is slow (Fetching unique values). All you need in that case is just to sort ser.

In [23]:

%timeit np.arange(1, ser.size+1)/float(ser.size)
10000 loops, best of 3: 11.6 µs per loop

The fastest version that I can think of is to use get vectorized:

In [35]:

np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
Out[35]:
array([ 0.55555556,  0.33333333,  0.5       ,  0.61111111,  0.77777778,
        0.94444444,  0.88888889,  0.44444444,  0.38888889,  0.11111111,
        0.72222222,  0.27777778,  0.66666667,  0.22222222,  0.16666667,
        0.83333333,  1.        ,  0.11111111])

Add let see:

In [37]:

%timeit dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]
100 loops, best of 3: 6 ms per loop
In [38]:

%%timeit
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
1000 loops, best of 3: 827 µs per loop
In [39]:

%timeit np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
10000 loops, best of 3: 91.1 µs per loop
Lout answered 19/2, 2014 at 17:51 Comment(7)
Thank you @CT Zhu, I updated my question, can you have a look again? I changed the input to a DataFrame from a Series. Also I included some duplicate values. The crux is that, all indices are unique but the values can have duplicate values. I still want to get the ECDF of the values Series under he dataframe, but sorting it should affect the entire dataframe.Milden
You are welcome, and check out the new edit. Also I think your ecdf_rgives a slightly different result. Using the fully vectorized version may not be the best choice in production code (readability issue). But I think you are in a research setting right?Lout
Thank you again @CT Zhu, I will accept the answer, but the code does not work. What is dfser['values'].values[...,newaxis]? I don't understand what [...,newaxis] is. P.S: yes ecdf_r returns a slightly different result but I decided it is OK for the applciation since it returns the same value for duplicates.Milden
the times indicate that you have a much faster solution than my ranking one. However, I do not understand what [...,newaxis] stands for in your solution?Milden
That is not very well documented, but can be found here docs.scipy.org/doc/numpy/user/basics.broadcasting.html, in the very last few lines. Let me know it that makes sense to you.Lout
There's a bug in the first implementation. sq.sort_index().cumsum()*1./len(sq). That should be len(ser) -- the length of the original list. It works fine in this example because all the values in ser are unique. If ser were to have duplicates though, you would be dividing by an incorrect length.Distaste
@Distaste is correct, the norming of the ecdf is wrong if there are non-unique values in the data. either 1./len(ser) or 1./sq.sum() would be correct.Mele

© 2022 - 2024 — McMap. All rights reserved.