Visualization of scatter plots with overlapping points in matplotlib
Asked Answered
W

4

47

I have to represent about 30,000 points in a scatter plot in matplotlib. These points belong to two different classes, so I want to depict them with different colors.

I succeded in doing so, but there is an issue. The points overlap in many regions and the class that I depict for last will be visualized on top of the other one, hiding it. Furthermore, with the scatter plot is not possible to show how many points lie in each region. I have also tried to make a 2d histogram with histogram2d and imshow, but it's difficult to show the points belonging to both classes in a clear way.

Can you suggest a way to make clear both the distribution of the classes and the concentration of the points?

EDIT: To be more clear, this is the link to my data file in the format "x,y,class"

Wonderland answered 28/9, 2013 at 8:2 Comment(5)
Why not a histogram with two colors? Doesn't it look good enough?Baize
@OfirIsrael I have tried to use histogram2d and imshow with alpha levels to have two overlapping histograms, but the result seems to be very poorWonderland
Have you tried showing the histograms using contour instead of alpha blending? matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.contourPeskoff
do not add noise to your data, that is crossing the line into immoral data manipulation.Masaccio
@tcaswell It is crossing the line into immoral data manipulation if and only if you hide it.Epitasis
I
51

One approach is to plot the data as a scatter plot with a low alpha, so you can see the individual points as well as a rough measure of density. (The downside to this is that the approach has a limited range of overlap it can show -- i.e., a maximum density of about 1/alpha.)

Here's an example:

enter image description here

As you can imagine, because of the limited range of overlaps that can be expressed, there's a tradeoff between visibility of the individual points and the expression of amount of overlap (and the size of the marker, plot, etc).

import numpy as np
import matplotlib.pyplot as plt

N = 10000
mean = [0, 0]
cov = [[2, 2], [0, 2]]
x,y = np.random.multivariate_normal(mean, cov, N).T

plt.scatter(x, y, s=70, alpha=0.03)
plt.ylim((-5, 5))
plt.xlim((-5, 5))
plt.show()

(I'm assuming here you meant 30e3 points, not 30e6. For 30e6, I think some type of averaged density plot would be necessary.)

Indictment answered 28/9, 2013 at 15:22 Comment(0)
H
42

You could also colour the points by first computing a kernel density estimate of the distribution of the scatter, and using the density values to specify a colour for each point of the scatter. To modify the code in the earlier example :

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde as kde
from matplotlib.colors import Normalize
from matplotlib import cm

N = 10000
mean = [0,0]
cov = [[2,2],[0,2]]

samples = np.random.multivariate_normal(mean,cov,N).T
densObj = kde( samples )

def makeColours( vals ):
    colours = np.zeros( (len(vals),3) )
    norm = Normalize( vmin=vals.min(), vmax=vals.max() )

    #Can put any colormap you like here.
    colours = [cm.ScalarMappable( norm=norm, cmap='jet').to_rgba( val ) for val in vals]

    return colours

 colours = makeColours( densObj.evaluate( samples ) )

 plt.scatter( samples[0], samples[1], color=colours )
 plt.show()

Scatter plot with density information

I learnt this trick a while ago when I noticed the documentation of the scatter function --

c : color or sequence of color, optional, default : 'b'

c can be a single color format string, or a sequence of color specifications of length N, or a sequence of N numbers to be mapped to colors using the cmap and norm specified via kwargs (see below). Note that c should not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. c can be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.

Hardbitten answered 18/5, 2016 at 12:10 Comment(4)
This is a stunning solution to one of the most often encountered problems in plotting large datasets. Excellent work!Lavelle
Is there any way to add a colorbar to the above figure to indicate the density of each color?Aperiodic
You can optimize this solution by simply using the cmap kwarg of the scatter method (i.e. plt.scatter(samples[0], samples[1], c=densObj.evaluate(samples), cmap="jet") with no need for the extra function.Macur
@ZackEriksen try plt.colorbar()Macur
M
4

My answer may not perfectly answer your question, however, I too tried to plot overlapping points, but mine were perfectly overlapped. I therefore came up with this function in order to offset identical points.

import numpy as np

def dodge_points(points, component_index, offset):
    """Dodge every point by a multiplicative offset (multiplier is based on frequency of appearance)

    Args:
        points (array-like (2D)): Array containing the points
        component_index (int): Index / column on which the offset will be applied 
        offset (float): Offset amount. Effective offset for each point is `index of appearance` * offset

    Returns:
        array-like (2D): Dodged points
    """

    # Extract uniques points so we can map an offset for each
    uniques, inv, counts = np.unique(
        points, return_inverse=True, return_counts=True, axis=0
    )

    for i, num_identical in enumerate(counts):
        # Prepare dodge values
        dodge_values = np.array([offset * i for i in range(num_identical)])
        # Find where the dodge values must be applied, in order
        points_loc = np.where(inv == i)[0]
        #Apply the dodge values
        points[points_loc, component_index] += dodge_values

    return points

Here is an example of before and after.

Before:

Before dodge

After:

After Dodge

This method only works for EXACTLY overlapping points (or if you are willing to round points off in a way that np.unique finds matching points).

Manado answered 24/11, 2021 at 21:32 Comment(0)
A
0

Using transparency/alpha, as suggested in this answer, can be very helpful for cases where crowding is a problem. However, if you have multiple data classes, you can still have issues with the later-plotted classes obscuring the earlier ones, especially if some classes have more data points than others.

This answer uses a heatmap to show density, which is great when you want to show the density for a single class, but not straightforward to adapt to the case where you have multiple overlapping classes and want all to be visible.

One approach I've sometimes found helpful in this situation is to randomize plot order, instead of plotting classes sequentially. This can be combined with transparency.

For instance, modifying the example given in tom10's answer:

import numpy as np
import matplotlib.pyplot as plt

N0 = 2000
x0 = np.random.normal(0,2,N0)
y0 = np.random.normal(0,0.2,N0) + 0.25/(x0**2+0.25)

plt.scatter(x0, y0, s=70, alpha=0.03,c="r")

N1 = 10000
mean = [0, 0]
cov = [[2, 2], [0, 2]]
x1,y1 = np.random.multivariate_normal(mean, cov, N1).T

plt.scatter(x1, y1, s=70, alpha=0.03,c="b")
plt.ylim((-5, 5))
plt.xlim((-5, 5))
plt.show()

results in:

A scatter plot with a bivariate normally distributed cluster of blue points plotted over red points. The red points appear to be located approximately on the line y=0 but near the middle they are largely obscured by the blue plots.

Glancing at this plot, the red points appear to be distributed close to the line y = 0.

But if we randomize plot order:

x = np.concatenate((x0,x1))
y = np.concatenate((y0,y1))
cols = np.concatenate((np.tile("r",len(x0)),np.tile("b",len(x1))))

rng = np.random.default_rng()
neworder=rng.permutation(len(x))

x_shuffled = x[neworder]
y_shuffled = y[neworder]
cols_shuffled = cols[neworder]

plt.ylim((-5, 5))
plt.xlim((-5, 5))
plt.show()

we get this:

The same plot as above, but instead of plotting blue over red, they have been plotted in random order, making them more visible near the centre of the plot. In this plot it is easier to see that near (0,0), the red points deviate considerably from the line y = 0.

It's now much easier to see that near x = 0, the red points deviate significantly from the y=0 relationship that we'd have guessed when we could only see the edges of that distribution.

We could achieve similar results by binning the points (e.g. on a hex or square grid) and then setting colour and alpha for each bin according to the class distributions and number of data points for each bin. But I'm fond of the random-order approach because it's lower-tech and reduces the number of methods I need to remember.

Alpert answered 15/11, 2023 at 0:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.