Plot contours for the densest region of a scatter plot

Asked 11/10, 2013 at 6:52 Answered 25/10, 2019 at 19:8

Solved python numpy matplotlib scipy contour

I am generating a scatter plot of ~300k data points and am having the issue that it is so over-crowded in some places that no structure is visible - So I had a thought!

I want to have the plot generate a contour plot for the densest parts and leave the less-dense areas with the scatter() data points.

So I was trying to individually compute a nearest-neighbour distance for each of the data points and then when this distance hit a specific value, draw a contour and fill it, then when it hit a much larger value (less dense) just do the scatter...

I have been trying and failing for a few days now, I am not sure that the conventional contour plot will work in this case.

I would supply code but it is so messy and would probably just confuse the issue. And it is so computationally intensive that it would probably just crash my pc if it did work!

Thank you all in advance!

p.s. I have been searching and searching for an answer! I am convinced it is not even possible for all the results it turned up!

Edit: So the idea of this is to see where some particular points lie within the structure of the 300k sample. Here is an example plot, my points are scattered in three diff. colours. My scatter version of the data

I will attempt to randomly sample 1000 datapoints from my data and upload it as a text file. Cheers Stackers. :)

Edit: Hey, Here are some sample data 1000 lines - just two columns [X,Y] (or [g-i,i] from plot above) space delimited. Thank you all! the data

Acanthoid answered 11/10, 2013 at 6:52 Comment(8)

Depending on how crowded these values are, you could probably tease some structure out by just doing scatter(x, y, alpha=0.1) or some suitable small value. To do what you suggest, I would build a kernel density estimate (see scipy.stats.kde). – Gouache 11/10, 2013 at 8:5

Why dont you use a 2d histogram to show your data? – Maladminister 11/10, 2013 at 8:12

@Acanthoid you can just supply random data that is of the same type/shape/etc as your real data - you don't always need to post the complicated steps that generated the real data in the first place. Makes it easier for us to give answers that are useful to you. – Delorasdelorenzo 11/10, 2013 at 10:0

@RutgerKassies - That doesn't really display that data in a meaningful way, and is subject to binning issues. Also, it is hard to correctly represent it in a print out. – Acanthoid 11/10, 2013 at 21:51

@Acanthoid "That doesn't really display that data in a meaningful way, and is subject to binning issues. Also, it is hard to correctly represent it in a print out." - what do you mean it's not meaningful? Histograms are a totally valid way to convey where the mass of your distribution is. There's absolutely nothing to be gained by plotting the exact position of every x,y point in a cloud of 300,000, where of them are overlapping one another anyway. It's not that hard to find a colormap that will look good on a printout. – Wilinski 12/10, 2013 at 1:7

@Wilinski I agree, that was an unfair statement. ''There's absolutely nothing to be gained by plotting the exact position of every x,y point'' This is also true, which is why I am trying to do the contour plot. I have messed around with colour maps plt.hexbin() but I do not think that they are as instantly clear as a contour plot. Nor is it as easy (for a viewer) to quantitatively determine the value of specific regions. Sorry for the misunderstanding. – Acanthoid 12/10, 2013 at 1:43

@Acanthoid So why not use np.histogram2d to make an array of bin counts, then draw them as a contour plot instead? In terms of quantification you could normalize by bin size so that your values correspond to the density of points in each bin. You could also use KDE and plot the estimated probability density function of your data, although this has a slightly different meaning to your original plot. – Wilinski 12/10, 2013 at 9:49

Hey @Wilinski thanks for the input! That suggestion seems pretty solid, I will sit down and think it through properly when I have more time. I have settled for a 2d histogram for now, but I am very keen to nut this one out! I will update when I have a breakthrough! Cheers - Frisky – Acanthoid 13/10, 2013 at 9:15

4 years later and I can finally answer this! this can be done using contains_points from matplotlib.path.

I've used a Gaussian smoothing from astropy which can be omitted or substituted as needed.

import matplotlib.colors as colors
from matplotlib import path
import numpy as np
from matplotlib import pyplot as plt
try:
    from astropy.convolution import Gaussian2DKernel, convolve
    astro_smooth = True
except ImportError as IE:
    astro_smooth = False

np.random.seed(123)
t = np.linspace(-1,1.2,2000)
x = (t**2)+(0.3*np.random.randn(2000))
y = (t**5)+(0.5*np.random.randn(2000))

H, xedges, yedges = np.histogram2d(x,y, bins=(50,40))
xmesh, ymesh = np.meshgrid(xedges[:-1], yedges[:-1])

# Smooth the contours (if astropy is installed)
if astro_smooth:
    kernel = Gaussian2DKernel(stddev=1.)
    H=convolve(H,kernel)

fig,ax = plt.subplots(1, figsize=(7,6)) 
clevels = ax.contour(xmesh,ymesh,H.T,lw=.9,cmap='winter')#,zorder=90)

# Identify points within contours
p = clevels.collections[0].get_paths()
inside = np.full_like(x,False,dtype=bool)
for level in p:
    inside |= level.contains_points(zip(*(x,y)))

ax.plot(x[~inside],y[~inside],'kx')
plt.show(block=False)

Acanthoid answered 1/8, 2017 at 10:22 Comment(0)

You can achieve this with a variety of numpy/scipy/matplotlib tools:

Create a scipy.spatial.KDTree of the original points for fast lookup.
Use np.meshgrid to create a grid of points at the resolution you want the contour
Use KDTree.query to create a mask of all locations that are within the target density
Bin the data, either with a rectangular bin or plt.hexbin.
Plot the contour from the binned data, but use the mask from step 3. to filter out the lower density regions.
Use the inverse of the mask to plt.scatter the remaining points.

Amniocentesis answered 29/10, 2013 at 19:58 Comment(1)

I haven't actually tried this directly, but this is essentially what i ended up doing. I resorted to using a hexbin 'heat plot' because I couldn't reduce the computation time of the contour stuff from order n^n -_- ... might be worth going back and looking at it was a fun problem. – Acanthoid 27/7, 2014 at 23:5

Perhaps someone (like me) will stumble across the internet searching for an answer. @FriskyGrub, I like your smoothing approach. There is a solution within AstroML library, example at https://www.astroml.org/book_figures/chapter1/fig_S82_scatter_contour.html#book-fig-chapter1-fig-s82-scatter-contour . I'm not sure how you set the threshold in your code (above which to include points in the contour rather than scatter), but I managed to reproduce a similar result to yours with :

import matplotlib.pyplot as plt
from astroML.plotting import scatter_contour
np.random.seed(123)
t = np.linspace(-1,1.2,2000)
x = (t**2)+(0.3*np.random.randn(2000))
y = (t**5)+(0.5*np.random.randn(2000))
fig,ax = plt.subplots(1,1,figsize=(6,6))
scatter_contour(x,y, threshold=15, log_counts=True, ax=ax,
            histogram2d_args=dict(bins=15),
            plot_args=dict(marker='+', linestyle='none', color='black',
                          markersize=5),
            contour_args=dict(cmap='winter',),
           filled_contour=False)

( scatter_contour?? brings up a lot of docs with help, but basically as kwargs suggest, histogram2d_args are those args taken by numpy.histogram2d, and plot_args are args taken by scatter plt.plot, and contour_args those by plt.contour (or plt.contourf)

Best wishes

Chris

Daegal answered 25/10, 2019 at 19:8 Comment(0)

Recommended topics

Hot tags