ECDF in python without step function?
Asked Answered
D

1

2

I have been using ECDF (empirical cumulative distribution function) from statsmodels.distributions to plot a CDF of some data. However, ECDF uses a step function and as a consequence I get jagged-looking plots.

enter image description here

So my question is: Do scipy or statsmodels have a ECDF baked-in without a step function?

By the way, I know I can do this:

hist, bin_edges = histogram(b_oz, normed=True)
plot(np.cumsum(hist))

but I don't get the right scales.

Thanks!

Durfee answered 22/12, 2012 at 20:56 Comment(5)
If you are worried about the data itself, a nice sanity check is R's ecdf function. If you're comfortable R, pull that data into R and run "plot(ecdf(your_data))" which should give you a reliable picture.Breakup
The ECDF is by definition a step function, reflecting the actual data. None of the plotted functions seems to be a "true" ECDF. To say "ECDF without a step function" seems to be a contradiction in terms.Pilot
This question is really old but I think I meant to describe (or approximate) the true cumulative distribution function, which is not composed of step functions.Durfee
You could just integrate a kernel density estimate to the desired result.Bergquist
It’s not baked-in, but the one liner in this answer does what you want.Gilgai
C
10

If you just want to change the plot, then you could let matplotlib interpolate between the observed values.

>>> xx = np.random.randn(nobs)
>>> ecdf = sm.distributions.ECDF(xx)
>>> plt.plot(ecdf.x, ecdf.y)
[<matplotlib.lines.Line2D object at 0x07A872D0>]
>>> plt.show()

or sort original data and plot

>>> xx.sort()
>>> plt.plot(xx, ecdf(xx))
[<matplotlib.lines.Line2D object at 0x07A87090>]
>>> plt.show()

which is the same as plotting it directly

>>> a=0; plt.plot(xx, np.arange(1.,nobs+1)/(nobs+a))
[<matplotlib.lines.Line2D object at 0x07A87D30>]
>>> plt.show()

Note: depending on how you want the ecdf to behave at the boundaries and how it will be centered, there are different normalizations for "plotting positions" that are in common use, like the parameter a that I added as example a=1 is a common choice.

As alternative to using the empirical cdf, you could also use an interpolated or smoothed ecdf or histogram, or a kernel density estimate.

Clemmy answered 23/12, 2012 at 1:0 Comment(3)
Yes, the problem here is that the data is not so varied as a sample created using randn(), so I still get a jagged plot because the distribution is applying a step function between values. Therefore, even when I use ecdf.x and ecdf.y (by the way, nice tip... I didn't know I could do that), I get exactly the same result (with 9000+ data points).Durfee
ECDF applies the step function only to points in between the original observed points. Points different from the observed points will be defined by the step function as a definition of the ecdf. Maybe your original data is binned if it looks step like in the plot, when you only plot original points. If you want a non-step cdf, then instead of the ecdf you could use a linear interpolation of the ecdf points (observations) which would correspond to a piecewise linear density as in a histogram.Clemmy
Interesting. I think the original points has a lot of repeated points and it's not so well distributed as randn(). Yes, I will take a look at interpolating ecdf points. Thanks.Durfee

© 2022 - 2024 — McMap. All rights reserved.