Python: How to find the n-th quantile of a 2-d distribution of points
Asked Answered
D

1

5

I have a 2D-distribution of points (roughly speaking, two np.arrays, x and y) as shown in the figure attached.

How can I select the points of the distribution that are part of the n-th quantile of such distribution?

2d-distribution of points

Dicot answered 30/3, 2018 at 8:44 Comment(3)
In dimensions > 1 the quantile is not uniquely defined. A 1D quantile is simply a threshold, but in 2D it could be any closed curve that divides the data set into an 'inside' and an 'outside' region. If you had a multivariate normal distribution, a sensible choice for this curve would be an ellipse centered on the median/mean. But for the data you show I don't see any obvious shape...Shipyard
Personally, I would define the 2D quantile as the minimal convex (for tractability) area that includes x% of the data points. No idea if this definition is in any way correct or useful, and finding it is probably NP-hard.Shipyard
ncbi.nlm.nih.gov/pmc/articles/PMC4659409Vinegary
D
9

I finally came out with a solution, which doesn't look like the most elegant possible, but it worked reasonably well: To estimate quantiles of a 2 dimensional distribution one can use the scipy function binned_statistics which allows to bin the data in one of the and calculate some statistics in the other. Here is the documentation of such function: https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.binned_statistic.html Which syntax is: scipy.stats.binned_statistic(x, values, statistic='mean', bins=10, range=None)

First, one might chose the number of bins to use, for example Nbins=100. Next, one might define an user function to be put as input (here is an example on how to do so: How to make user defined functions for binned_statistic), which is my case is an function that estimates the n-th percentile of the data in that bin (I called it myperc). Finally a function is defined such as it takes x, y, Nbins and nth (the percentile desired) and returns the binned_statistics gives 3 outputs: statistic (value of the desired statistics in that bin),bin_edges,binnumber (in which bin your data-point is), but also the values of x in center of the bin (bin_center)

def quantile2d(x,y,Nbins,nth):
    from numpy import percentile
    from scipy.stats import binned_statistic
    def myperc(x,n=nth):
        return(percentile(x,n))
    t=binned_statistic(x,y,statistic=myperc,bins=Nbins)
    v=[]
    for i in range(len(t[0])): v.append((t[1][i+1]+t[1][i])/2.)
    v=np.array(v)
    return(t,v)

So v and t.statistic will give x and y values for a curve defining the desired percentile respectively.

Nbins=100
nth=30.
t,v=me.quantile2d(x,y,Nbins,nth)
ii=[]
for i in range(Nbins):
    ii=ii+np.argwhere(((t.binnumber==i) & (y<t.statistic[i]))).flatten().tolist()
ii=np.array(ii,dtype=int)

Finally, this gives the following plot:

plt.plot(x,y,'o',color='gray',ms=1,zorder=1)
plt.plot(v,t.statistic,'r-',zorder=3)
plt.plot(x[ii],y[ii],'o',color='blue',ms=1,zorder=2)

in which the line for the 30-th percentile is shown in red, and the data under this percentile is shown in blue.

Dicot answered 6/4, 2018 at 15:47 Comment(2)
Correct me if I'm wrong, but this solution seems to return the conditional quantiles/percentiles of the distribution (i.e. the n'th quantile of y based on the binned value of x). This is not the same thing as the multivariate quantile in n-dimensions. Perhaps you should consider including the term "conditional quantile" in your question/answer to avoid confusion?Joeyjoffre
I agree that the answer is different to the question. To add to your answer, you might want to consider looking are local quantile regression. It is possible that you won't find a package for that but the math behind is fairly easy :)Soutine

© 2022 - 2024 — McMap. All rights reserved.