I am evaluating vaex for an interactive outlier selection use case described at: Large plot: ~20 million samples, gigabytes of data
Basically, I have some individual points which are outliers, and I want to see them on a graph to manually select them and them examine them further.
The problem is that individual points become invisible if the rest of the dataset is too large.
How to make such individual points visible?
For example, if I generate a dataset with 1 billion points and one outlier on the center top:
import h5py
import numpy
size = 1000000000
with h5py.File('1b.hdf5', 'w') as f:
x = numpy.arange(size + 1)
x[size] = size / 2
f.create_dataset('x', data=x, dtype='int64')
y = numpy.arange(size + 1) * 2
y[size] = 3 * size / 2
f.create_dataset('y', data=y, dtype='int64')
z = numpy.arange(size + 1) * 4
z[size] = -1
f.create_dataset('z', data=z, dtype='int64')
and then display it on a Jupyter notebook with:
import vaex
df = vaex.open('1b.hdf5')
df.plot_widget(df.x, df.y, backend='bqplot')
I get this on Jupyter:
so I can't see the outlier which should be at the center top.
I can however select it since I know where it is, and it does show on selection=True
methods. It is just not getting displayed.
There are some examples at: https://vaex.readthedocs.io/en/latest/tutorial.html#Smaller-datasets-/-scatter-plot which look pretty visible, but I tried adding the extra arguments c="red", alpha=0.5, s=4
to plot_widget
and it did not work, presumably this backend does not support them.
Maybe there is a way to configure bqplot
to change its plotting style?
Tested on vaex 2.0.2.