Interactive large plot with ~20 million sample points and gigabytes of data
Asked Answered
E

6

106

I have got a problem (with my RAM) here: it's not able to hold the data I want to plot. I do have sufficient HD space. Is there any solution to avoid that "shadowing" of my data-set?

Concretely I deal with Digital Signal Processing and I have to use a high sample-rate. My framework (GNU Radio) saves the values (to avoid using too much disk space) in binary. I unpack it. Afterwards I need to plot. I need the plot zoomable, and interactive. And that is an issue.

Is there any optimization potential to this, or another software/programming language (like R or so) which can handle larger data-sets? Actually I want much more data in my plots. But I have no experience with other software. GNUplot fails, with a similar approach to the following. I don't know R (jet).

import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import struct

"""
plots a cfile

cfile - IEEE single-precision (4-byte) floats, IQ pairs, binary
txt - index,in-phase,quadrature in plaintext

note: directly plotting with numpy results into shadowed functions
"""

# unpacking the cfile dataset
def unpack_set(input_filename, output_filename):
    index = 0   # index of the samples
    output_filename = open(output_filename, 'wb')

    with open(input_filename, "rb") as f:

        byte = f.read(4)    # read 1. column of the vector

        while byte != "":
        # stored Bit Values
            floati = struct.unpack('f', byte)   # write value of 1. column to a variable
            byte = f.read(4)            # read 2. column of the vector
            floatq = struct.unpack('f', byte)   # write value of 2. column to a variable
            byte = f.read(4)            # next row of the vector and read 1. column
            # delimeter format for matplotlib 
            lines = ["%d," % index, format(floati), ",",  format(floatq), "\n"]
            output_filename.writelines(lines)
            index = index + 1
    output_filename.close
    return output_filename.name

# reformats output (precision configuration here)
def format(value):
    return "%.8f" % value            

# start
def main():

    # specify path
    unpacked_file = unpack_set("test01.cfile", "test01.txt")
    # pass file reference to matplotlib
    fname = str(unpacked_file)
    plt.plotfile(fname, cols=(0,1)) # index vs. in-phase

    # optional
    # plt.axes([0, 0.5, 0, 100000]) # for 100k samples
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")

    plt.show();

if __name__ == "__main__":
    main()

Something like plt.swap_on_disk() could cache the stuff on my SSD ;)

Exodus answered 2/5, 2011 at 7:25 Comment(4)
what do you mean by "directly plotting with numpy results into shadowed functions"?Phlogistic
I don't understand how you get "Gigabytes" of data. 20 million x (3 x (4 bytes)) = 240MB, right? And @EOL is completely right -- converting all that perfectly good binary data into a text format is a complete waste of time and I/O, use numpy to access the binary directly.Unvoice
Weird, i don't see you storing floati and floatq anywhere? Or is the line starting with 'lines' meant to be inside the while loop?Misstep
Interesting that I can't edit it because of the 6 non-space character rule for edits. Sometimes, 4 spaces is all it needs to make code completely non-working... ;)Misstep
U
104

So your data isn't that big, and the fact that you're having trouble plotting it points to issues with the tools. Matplotlib has lots of options and the output is fine, but it's a huge memory hog and it fundamentally assumes your data is small. But there are other options out there.

So as an example, I generated a 20M data-point file 'bigdata.bin' using the following:

#!/usr/bin/env python
import numpy
import scipy.io.numpyio

npts=20000000
filename='bigdata.bin'

def main():
    data = (numpy.random.uniform(0,1,(npts,3))).astype(numpy.float32)
    data[:,2] = 0.1*data[:,2]+numpy.exp(-((data[:,1]-0.5)**2.)/(0.25**2))
    fd = open(filename,'wb')
    scipy.io.numpyio.fwrite(fd,data.size,data)
    fd.close()

if __name__ == "__main__":
    main()

This generates a file of size ~229MB, which isn't all that big; but you've expressed that you'd like to go to even larger files, so you'll hit memory limits eventually.

Let's concentrate on non-interactive plots first. The first thing to realize is that vector plots with glyphs at each point are going to be a disaster -- for each of the 20 M points, most of which are going to overlap anyway, trying to render little crosses or circles or something is going to be a diaster, generating huge files and taking tonnes of time. This, I think is what is sinking matplotlib by default.

Gnuplot has no trouble dealing with this:

gnuplot> set term png
gnuplot> set output 'foo.png'
gnuplot> plot 'bigdata.bin' binary format="%3float32" using 2:3 with dots

gnuplot

And even Matplotlib can be made to behave with some caution (choosing a raster back end, and using pixels to mark points):

#!/usr/bin/env python
import numpy
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

datatype=[('index',numpy.float32), ('floati',numpy.float32), 
        ('floatq',numpy.float32)]
filename='bigdata.bin'

def main():
    data = numpy.memmap(filename, datatype, 'r') 
    plt.plot(data['floati'],data['floatq'],'r,')
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")
    plt.savefig('foo2.png')

if __name__ == "__main__":
    main()  

matplotlib

Now, if you want interactive, you're going to have to bin the data to plot, and zoom in on the fly. I don't know of any python tools that will help you do this offhand.

On the other hand, plotting-big-data is a pretty common task, and there are tools that are up for the job. Paraview is my personal favourite, and VisIt is another one. They both are mainly for 3D data, but Paraview in particular does 2d as well, and is very interactive (and even has a Python scripting interface). The only trick will be to write the data into a file format that Paraview can easily read.

Unvoice answered 2/5, 2011 at 13:58 Comment(4)
Great post. +1 for VisIt and ParaView mentions - they are both useful and poweful visualisation programs, designed to handle (very!) large datasets. Note that VisIt also has a Python scripting interface and can draw 1D, in addition to 2D and 3D, plots (curves). In terms of a file format, VTK is a relatively straightforward format that both programs support (indeed ParaView is built on the VTK library).Puttergill
Check out Bokeh Datashader, which "turns even the largest data into images": github.com/bokeh/datashaderMulder
Thanks for mentioning Paraview and VisIt. Both managed to cover my 2D use case mentioned at: https://mcmap.net/q/203599/-interactive-large-plot-with-20-million-sample-points-and-gigabytes-of-dataConfirm
I tried to plot a column of my csv file with ~3 million rows using paraview. However, I am not able to understand how to use the zoom feature for the line plot. The plot was generated very fast but it seems that the plot is not really interactive. Can someone please help?Derogative
C
103

A survey of open source interactive plotting software with a 10 million point scatter plot benchmark on Ubuntu

Inspired by the use case described at: https://stats.stackexchange.com/questions/376361/how-to-find-the-sample-points-that-have-statistically-meaningful-large-outlier-r I have benchmarked a few plotting programs with the exact same input files.

Basically, I wanted to:

  • do an XY scatter plot of multidimensional data, hopefully with Z as the point color
  • interactively select some interesting looking points from the plot with my mouse
  • view all dimensions of the selected points (including at least X, Y and Z) to try and understand why they are outliers in the XY scatter

That problem can be represented by the following simplified test data:

python -c 'for i in range(10000000): print(f"{i},{i*2},{i*4}")' > 10m1.csv
echo 5000000,20000000,-1 >> 10m1.csv

The first few lines of 10m1.csv (~239 MB) look like this:

10m1.csv

0,0,0
1,2,4
2,4,8
3,6,12
4,8,16

and the very last one, the 10 million-first, is the outlier, and looks like:

5000000,20000000,-1

so we basically have:

  • a line with inclination 2 and 10 million points on it
  • plus a single outlier point outside of the line, on the top center of the plot

something like:

Y

^
|
|
|       +       +
|
|             +
|
|           +
|
|         +
|
|       +
|
|     +
|
|   +
|
| +
|
+-------------------> X

And the goal of this benchmark to find the point (5000000,20000000) on the graphical plot, and then determine the value of the third column from it, which is -1 in our test.

When I first wrote this answer, I had used 10.csv generated with:

python -c 'for i in range(10000000): print(f"{i},{i*2},{i*4}")' > 10m.csv

without the outlier. While this tests performance, it does not test selection capabilities, so the goal is to migrate each test to 10m1.csv when I find motivation to do it.

I also made a 10 point + outlier example in case I wanted to evaluate usability for some tool that could not handle the 10m point count:

i=0;
while [ "$i" -lt 10 ]; do
  echo "$i,$((2 * i)),$((4 * i))"; i=$((i + 1));
done > 11.csv
echo 5,20,-1 >> 11.csv

To have extra fun, I also prepared an even larger 1 billion point dataset in case any of the programs could handle the 10 million points! CSV files were getting a bit wonky, so I moved to HDF5:

#!/usr/bin/env python3

import h5py
import numpy

size = 1000000000

with h5py.File('1b.hdf5', 'w') as f:
    x = numpy.arange(size + 1)
    x[size] =  size / 2
    f.create_dataset('x', data=x, dtype='int64')
    y = numpy.arange(size + 1) * 2
    y[size] =  3 * size / 2
    f.create_dataset('y', data=y, dtype='int64')
    z = numpy.arange(size + 1) * 4
    z[size] = -1
    f.create_dataset('z', data=z, dtype='int64')

This produces a ~23 GiB file analogous to 10m1.csv containing:

  • 1 billion points in a straight line much like 10m.csv
  • one outlier point at the center top of the graph

I'm also creating a SQLite version of 10m1.csv, because that is perhaps one of the most reasonable formats to work with in practice, given that it will allow for well understood SQL querying, explicit indexing control and binary numeric data:

f=10m.sqlite
rm -f "$f"
n=10000000
time sqlite3 "$f" 'create table t(x integer, y integer, z integer)'
time sqlite3 "$f" "insert into t select value as id, value as x, value * 2 as y, value * 2 as z from generate_series(0, $((n - 1)))"
time sqlite3 "$f" "INSERT INTO t VALUES (?, ?, ?)', ($((n/2)), $((3*n/2)), -1))"
time sqlite3 "$f" 'create index txy on t(x, y)'

I also ran that code with n = 1 billion to produce a 1b.sqlite. generate_series is the fastest insertion method I could find so far: Bulk insert huge data into SQLite using Python

I index by (x, y) as that is presumably what would speed up queries made by a viewer tool trying to get all points in a given x-y rectangle. The resulting 10m1.sqlite is about 367 MB, which is larger than the CSV due to the index.

The tests were carried out in Ubuntu 18.10 unless mentioned otherwise in the a subsection, in a ThinkPad P51 laptop with Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB), NVIDIA Quadro M1200 4GB GDDR5 GPU.

Summary of results

This is what I observed, considering my very specific test use case and that I'm a first time user of many of the reviewed software:

Does it handle 10 million points:

Tool Handles 10m points? Lots of features? UI feels good?
Vaex Yes, even 1 Billion! Yes. Yes, Jupyter widget
VisIt Yes, but not 100m Yes, 2D and 3D, focus on interactive. No
Paraview No Same as above, a bit less 2D features maybe. Very
Mayavi Yes 3D only, good interactive and scripting support, but more limited features. OK
gnuplot Barely on non-interactive mode. Lots of features, but limited in interactive mode. OK
matplotlib No Same as above. OK
Bokeh No, up to 1m Yes, easy to script. Very, Jupyter widget
PyViz ? ? ?
seaborn ? ? ?
sqlitebrowser No Can visualize SQL query results Meh

Vaex 2.0.2

https://github.com/vaexio/vaex

Install and get a hello world working as shown at: How to do interactive 2D scatter plot zoom / point selection in Vaex?

I tested vaex with up to 1 billion points and it worked, it is awesome!

It is "Python-scripted-first" which is great for reproducibility, and allows me to easily interface with other Python things.

The Jupyter setup has a few moving parts, but once I got it running with virtualenv, it was amazing.

To load our CSV run in Jupyter:

import vaex
df = vaex.from_csv('10m.csv', names=['x', 'y', 'z'],)
df.plot_widget(df.x, df.y, backend='bqplot')

and we can see instantly:

enter image description here

Now, we can zoom, pan and select points with the mouse, and updates are really fast, all in under 10 seconds. Here I have zoomed in to see some individual points and have selected a few of them (faint lighter rectangle on image):

enter image description here

After the selection is made with the mouse, this has the exact same effect as using the df.select() method. So we can extract the selected points by running in Jupyter:

df.to_pandas_df(selection=True)

which outputs data with format:

        x       y        z   index
0 4525460 9050920 18101840 4525460
1 4525461 9050922 18101844 4525461
2 4525462 9050924 18101848 4525462
3 4525463 9050926 18101852 4525463
4 4525464 9050928 18101856 4525464
5 4525465 9050930 18101860 4525465
6 4525466 9050932 18101864 4525466

Since 10M points worked fine, I decided to try 1B points... and it also worked fine!

import vaex
df = vaex.open('1b.hdf5')
df.plot_widget(df.x, df.y, backend='bqplot')

To observe the outlier, which was invisible on the original plot, we can follow How change the point style in a vaex interactive Jupyter bqplot plot_widget to make individual points larger and visible? and use:

df.plot_widget(df.x, df.y, f='log', shape=128, backend='bqplot')

which produces:

enter image description here

and after selecting the point:

enter image description here

we obtain the outlier's full data:

   x          y           z
0  500000000  1500000000  -1

Here is a demo by the creators with a more interesting dataset and more features: https://www.youtube.com/watch?v=2Tt0i823-ec&t=770

There is no built-in sqlite support however unfortunately: https://github.com/vaexio/vaex/issues/864

Tested in Ubuntu 19.04.

VisIt 2.13.3

Website: https://wci.llnl.gov/simulation/computer-codes/visit

License: BSD

Developed by Lawrence Livermore National Laboratory, which is a National Nuclear Security Administration laboratory, so you can imagine that 10m points will be nothing for it if I could get it working. (The book The Supermen: The Story of Seymour Cray by Charles J. Murray (1997) does a good job a showing how computational power hungry labs such as these were when building the first H bombs, because you can't just run experiments at will with nukes, and even if you do, you can't really measure what you would like because it blows up too fast and too hot: a computer model is a must. And they decided that a bunch of physicist's wives with calculators wasn't going to cut it like as for the earlier Los Alamos fission bomb. When Israel bought one of their computers, everyone immediately assumed it was to make nukes.)

Installation: there is no Debian package, just download Linux binaries from website. Runs without installing. See also: https://askubuntu.com/questions/966901/installing-visit

Based on VTK which is the backend library that many of the high perfomance graphing software use. Written in C.

After 3 hours of playing with the UI, I did get it working, and it did solve my use case as detailed at: https://stats.stackexchange.com/questions/376361/how-to-find-the-sample-points-that-have-statistically-meaningful-large-outlier-r

Here is how it looks like on the test data of this post:

enter image description here

and a zoom with some picks:

enter image description here

and here is the picks window:

enter image description here

Performance wise, VisIt was very good: every graphic operation either took only a small amount of time or was immediate. When I had to wait, it shows a "processing" message with the percentage of work left, and the GUI didn't freeze.

Since 10m points worked so well, I also tried 100m points (a 2.7G CSV file) but it crashed / went into a weird state unfortunately, I watched it in htop as the 4 VisIt threads took up all of my 16GiB RAM and died likely due to a failed malloc.

The initial getting started was a bit painful:

  • many of the defaults feel atrocious if you are not a nuclear bomb engineer? E.g.:
  • there are just a lot of features, so it can be hard to find what you want
  • the manual was very helpful, but it is a 386 page PDF mammoth ominously dated "October 2005 Version 1.5". I wonder if they used this to develop Trinity! and it is a nice Sphinx HTML created just after I originally answered this question
  • no Ubuntu package. But the prebuilt binaries did just work.

I attribute these problems to:

  • it has been around for such a long time and uses some outdated GUI ideas
  • you can't just click on the plot elements to change them (e.g. axes, title, etc.), and there are a lot of features, so it is a bit hard to find the one your are looking for

I also love it how a bit of LLNL infrastructure leaks into that repo. See for example docs/OfficeHours.txt and other files in that directory! I'm sorry for Brad who is the "Monday Morning guy"! Oh, and the password for the answering machine is "Kill Ed", don't forget that.

Paraview 5.9.0

Website: https://www.paraview.org/

License: BSD

Tested on: Ubuntu 20.10.

Installation:

sudo apt install paraview

or get the latest by download prebuilts from the website. This is what I did for this review, since the apt one was only at 5.7.0. I downloaded ParaView-5.9.0-MPI-Linux-Python3.8-64bit.tar.gz.

Developed by Kitware and Los Alamos National Laboratory, and later Sandia National Laboratories (so the other two NNSA labs), so once again we expect that it will easily handle the data. Also VTK based and written in C++, which was further promising.

However I was disappointed: for some reason, 10m points made the GUI very slow and unresponsive, making it unusable. Whenever I clicked something, like to hide the lines, it took several dozen seconds. I think that at some point it just glitched out and stopped responding at all.

I'm fine with a controlled well advertised "I'm working now, wait a bit" moment, but the GUI freezing while that happens? Not acceptable.

htop showed that Paraview was using 8 threads and 3GB RAM, so neither CPU nor memory was maxed out.

GUI-wise, Paraview is very nice and modern, way better than VisIt when it is not stuttering.

Since 10m1.csv killed it, I tested with 11.csv to see if I would have been able to solve my problem except for performance, and the answer is yes:

  • paraview 11.csv
  • select CSV reader from the popup
  • properties properties Apply on the left
  • right click on the CSV on Pipeline Browser
  • Add filter > Alphabetical > Plot data. Why is plotting a filter? Not very intuitive for first time users, related: paraview: plot data from csv file I'm sure it is one of those things that make sense once you understand further generalizations of what filters can do, but still.
  • properties > Apply
  • unselect "Use index for x axis"
  • X Array Name: Field 0
  • Series Parameters remove Field 0 and Field 2
  • Select Field 1 and:
    • Line style: None
    • Marker style: cross
    • Marker size: increase or decrease as needed
  • "Rectangle Selection (s)" icon above the plot
  • Select outlier (point is highlighted)
  • Add another filter to the plot filter: "Extract Selection"
  • Apply

And finally!!! I get a table containing only the selected outlier, and showing the value of "Field 2" as -1:

enter image description here

So yes, not exactly a walk in the park, but I managed eventually.

Another downside is that Paraview felt lacking features compared to VisIt, e.g.:

Mayavi 4.6.2

Website: https://github.com/enthought/mayavi

Developped by: Enthought

Install:

sudo apt-get install libvtk6-dev
python3 -m pip install -u mayavi PyQt5

The VTK Python one.

Mayavi seems to be very focused on 3D, I could not find how to do 2D plots in it, so it does not cut it for my use case unfortunately.

Just to check performance however, I adapted the example from: https://docs.enthought.com/mayavi/mayavi/auto/example_scatter_plot.html for 10 million points, and it run just fine without lagging:

import numpy as np
from tvtk.api import tvtk
from mayavi.scripts import mayavi2

n = 10000000
pd = tvtk.PolyData()
pd.points = np.linspace((1,1,1),(n,n,n),n)
pd.verts = np.arange(n).reshape((-1, 1))
pd.point_data.scalars = np.arange(n)

@mayavi2.standalone
def main():
   from mayavi.sources.vtk_data_source import VTKDataSource
   from mayavi.modules.outline import Outline
   from mayavi.modules.surface import Surface
   mayavi.new_scene()
   d = VTKDataSource()
   d.data = pd
   mayavi.add_source(d)
   mayavi.add_module(Outline())
   s = Surface()
   mayavi.add_module(s)
   s.actor.property.trait_set(representation='p', point_size=1)
main()

Output:

enter image description here

I couldn't however zoom in enough to see indivitual points, the near 3D plane was too far. Maybe there is a way?

One cool thing about Mayavi is that devs put a lot of effort into allowing you to fire and setup the GUI from a Python script nicely, much like Matplotlib and gnuplot. It seems that this is also possible in Paraview, but the docs are not as good at least.

Generally it feels not a featurefull as VisIt / Paraview. For example, I couldn't directly load a CSV from the GUI: How to load a CSV file from the Mayavi GUI?

Gnuplot 5.2.2

Website: http://www.gnuplot.info/

gnuplot is really convenient when I need to go quick and dirty, and it is always the first thing that I try.

Installation:

sudo apt-get install gnuplot

For non-interactive use, it can handle 10m points reasonably well:

#!/usr/bin/env gnuplot
set terminal png size 1024,1024
set output "gnuplot.png"
set key off
set datafile separator ","
plot "10m1.csv" using 1:2:3:3 with labels point

which finished in 7 seconds:

enter image description here

But if I try to go interactive with

#!/usr/bin/env gnuplot
set terminal wxt size 1024,1024
set key off
set datafile separator ","
plot "10m.csv" using 1:2:3 palette

and:

gnuplot -persist main.gnuplot

then the initial render and zooms feel too sluggish. I can't even see the rectangle selection line!

Also note that for my use case, I needed to use hypertext labels as in:

plot "10m.csv" using 1:2:3 with labels hypertext

but there was a performance bug with the labels feature including for non-interactive rendering. But I reported it, and Ethan solved it in a day: https://groups.google.com/forum/#!topic/comp.graphics.apps.gnuplot/qpL8aJIi9ZE

I must say however that there is one reasonable workaround for outlier selection: just add labels with the row ID to all points! If there are many points nearby, you won't be able to read the labels. But for the outliers which you care about, you just might! For example, if I add one outlier to our original data:

cp 10m.csv 10m1.csv
printf '2500000,10000000,40000000\n' >> 10m1.csv

and modify the plot command to:

#!/usr/bin/env gnuplot
set terminal png size 1024,1024
set output "gnuplot.png"
set key off
set datafile separator ","
plot "10.csv" using 1:2:3:3 palette with labels

This slowed down the plotting significantly (40 mins after the fix mentioned above!!!), but produces a reasonable output:

enter image description here

so with some data filtering, we would get there, eventually.

Matplotlib 1.5.1, numpy 1.11.1, Python 3.6.7

Website: https://matplotlib.org/

Matplotlib is what I usually try when my gnuplot script starts getting too insane.

numpy.loadtxt alone took about 10 seconds, so I knew this wasn't going to go well:

#!/usr/bin/env python3

import numpy
import matplotlib.pyplot as plt

x, y, z = numpy.loadtxt('10m.csv', delimiter=',', unpack=True)
plt.figure(figsize=(8, 8), dpi=128)
plt.scatter(x, y, c=z)
# Non-interactive.
#plt.savefig('matplotlib.png')
# Interactive.
plt.show()

First the non-interactive attempt gave good output, but took 3 minutes and 55 seconds...

Then the interactive one took a long time on initial render and on zooms. Not usable:

enter image description here

Notice on this screenshot how the zoom selection, which should immediately zoom and disappear stayed on screen for a long time while it waited for zoom to be calculated!

I had to comment out plt.figure(figsize=(8, 8), dpi=128) for the interactive version to work for some reason, or else it blew up with:

RuntimeError: In set_size: Could not set the fontsize

Bokeh 1.3.1

https://github.com/bokeh/bokeh

Ubuntu 19.04 install:

python3 -m pip install bokeh

Then launch Jupyter:

jupyter notebook

Now if I plot 1m points, everything works perfectly, the interface is awesome and fast, including zoom and on hover information:

from bokeh.io import output_notebook, show
from bokeh.models import HoverTool
from bokeh.transform import linear_cmap
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
import numpy as np

output_notebook()
N = 1000000
source = ColumnDataSource(data=dict(
    x=np.random.random(size=N) * N,
    y=np.random.random(size=N) * N,
    z=np.random.random(size=N)
))
hover = HoverTool(tooltips=[("z", "@z")])
p = figure()
p.add_tools(hover)
p.circle(
    'x',
    'y',
    source=source,
    color=linear_cmap('z', 'Viridis256', 0, 1.0),
    size=5
)
show(p)

Initial view:

enter image description here

After a zoom:

enter image description here

If I go up to 10m though it chokes, htop shows that chromium has 8 threads taking up all my memory in uninterruptible IO state.

This asks about referencing the points: How to reference selected bokeh data points

PyViz

https://pyviz.org/

TODO evaluate.

Integrates Bokeh + datashader + other tools.

Video demoing 1B datapoints: https://www.youtube.com/watch?v=k27MJJLJNT4 "PyViz: Dashboards for Visualizing 1 Billion Datapoints in 30 Lines of Python" by "Anaconda, Inc." published on 2018-04-17.

seaborn

https://seaborn.pydata.org/

TODO evaluate.

There's already a QA on how to use seaborn to visualize at least 50 million rows.

sqlitebrowser 3.12.2

https://github.com/sqlitebrowser/sqlitebrowser

I tried this one to see if it could handle 10m1.sqlite but unfortunately it couldn't. Shame!

It is quite cool that it can directly plot query results though.

Here's how it looks like:

enter image description here

In this image, I loaded 10m1.sqlite into the tool, and then started browsing the data.

But it only plots the data that was loaded for browsing.

You can click the button on the bottom right under the plot to "Load all data and redraw plot" but that opens a progress bar that goes up 1% every 3s , so it is not looking promising and I gave up.

Tested on Ubuntu 23.04.

SQL histogram queries

I wonder why I can't easily find an interactive UI tool that uses this as a backend. SQL histograms on an indexed database feels like the most rational way to go about things. E.g. using 10 steps and ignoring empty bins:

div=10
x=0
y=0
x2=10000000
y2=20000000

dx=$(((x2 - x) / div))
dy=$(((y2 - y) / div))
time sqlite3 10m1.sqlite --cmd '.mode csv' <<EOF
select
  floor(x/$dx)*$dx as x,
  floor(y/$dy)*$dy as y,
  count(*) as cnt
from t
where
  x >= $x and x < $x2 and
  y >= $y and y < $y2
group by 1, 2
order by 1, 2
EOF

we reach:

0,0,1000000
1000000,2000000,1000000
2000000,4000000,1000000
3000000,6000000,1000000
4000000,8000000,1000000
5000000,10000000,1000000
5000000,20000000,1
6000000,12000000,1000000
7000000,14000000,1000000
8000000,16000000,1000000
9000000,18000000,1000000

and the query takes 6 seconds, so it can handle 10m points OK-ish, but won't scale to 1B.

Since we already have just 1 point, we can then just do a full listing in that range:

x=5000000
y=20000000
x2=6000000
y2=40000000
time sqlite3 10m1.sqlite --cmd '.mode csv' <<EOF
select *
from t
where
  x >= $x and x < $x2 and
  y >= $y and y < $y2
order by x, y
EOF

which instantly gives the final desired:

5000000,20000000,-1

So a GUI for this could have a maximum limit of point, where:

  • if the limit is crossed, use a heatmap
  • otherwise, query full individual points in that bin and plot individual points in the plot

How to scale SQL to 1B rows: R-tree indices

To scale to 1B, we need r-tree/spatial indices, which allow us to do inequalities on multiple columns efficiently. SQLite has them, but it is a bit annoying to use:

Despite these limitations, I ended up doing a 100m point test regardless with duplicated x/y columns and these creation time: 30 mins, file size: 5.9 GB

Then a capped count sweep:

max=100
div=10
x=0
y=0
x2=100000000
y2=200000000
dx=$(((x2 - x) / div))
dy=$(((y2 - y) / div))

cx=0
while [ $cx -lt $x2 ]; do
  cy=0
  while [ $cy -lt $y2 ]; do
    printf "$cx,$cy,"
    sqlite3 100mr.sqlite --cmd '.mode csv' <<EOF
select count(x) from (
  select x from t
  where
    x >= $cx and x < $((cx + dx)) and
    y >= $cy and y < $((cy + dy))
  limit $max
)
EOF
    cy=$((cy+dy))
  done
  cx=$((cx+dx))
done

finished in just 0.2s which is amazing. It would likely scale up to 1B if it weren't for the insanely long generation time.

PostgreSQL index creation wasn't faster either unfortunately: How to port simple spatial index using SQLite R-Trees to Postgres? though at least it supports points and not just rectangles.

Tested on Ubuntu 23.04.

Confirm answered 3/5, 2019 at 9:57 Comment(0)
C
17

A more recent project has strong potential for large data sets: Bokeh, which was created with exactly this in mind.

In fact, only the data that's relevant at the scale of the plot is sent to the display backend. This approach is much faster than the Matplotlib approach.

Clubhouse answered 15/3, 2014 at 3:51 Comment(1)
You might see go.continuum.io/visualizing-billions-data-points This is about Datashader datashader.readthedocs.io (and Bokeh!)Yocum
C
14

You can certainly optimize the reading of your file: you could directly read it into a NumPy array, so as to leverage the raw speed of NumPy. You have a few options. If RAM is an issue, you can use memmap, which keeps most of the file on disk (instead of in RAM):

# Each data point is a sequence of three 32-bit floats:
data = np.memmap(filename, mode='r', dtype=[('index', 'float32'), ('floati','float32'), ('floatq', 'float32')])

If RAM is not an issue, you can put the whole array in RAM with fromfile:

data = np.fromfile(filename, dtype=[('index', 'float32'), ('floati','float32'), ('floatq', 'float32')])

Plotting can then be done with Matplotlib's usual plot(*data) function, possibly through the "zoom in" method proposed in another solution.

Clubhouse answered 2/5, 2011 at 8:43 Comment(3)
If you'd like to create a structured numpy array and open a file as readonly then: np.memmap(filename, mode='r', dtype=[('floati','f'), ('floatq', 'f')]).Phlogistic
starting from 1 million points of doubles, I get Agg overflows in any backend I tried, also with path.simplify=True. So I don't believe that it will be possible to do this 'just like that' with Matplotlib.Misstep
Interesting. As Jonathan Dursi's answer mentions, 20 million points is achievable with Matplotlib, but with some constraints (raster output,…).Clubhouse
T
8

I would suggest something a bit complex but that should work : build your graph at different resolutions, for different ranges.

Think of Google Earth, for example. If you unzoom at maximum level to cover the whole planet, the resolution is the lowest. When you zoom, the pictures change for more detailed ones, but just on the region you're zooming on.

So basically for your plot (is it 2D ? 3D ? I'll assume it's 2D), I suggest you build one big graph that covers the whole [0, n] range with low resolution, 2 smaller graphs that cover [0, n/2] and [n/2 + 1, n] with twice the resolution of the big one, 4 smaller graphs that cover [0, n/4] ... [3 * n / 4 + 1, n] with twice the resolution of the 2 above, and so on.

Not sure my explanation is really clear. Also, I don't know if this kind of multi-resolution graph is handled by any existing plot program.

Tobiastobie answered 2/5, 2011 at 7:54 Comment(6)
in that case I could simply reduce my sample rate... but I need that many values in one interactive plot.Exodus
@Exodus : Do you really need millions of value in one plot, considering many of those will overlap ? Adaptive resolution seems reasonable, to me.Tobiastobie
no chance... specifically I'm recording real-time data encapsulated into a very short signal burst. That means all samples or nothing. You can't adapt on something you plan to analyze ;)Exodus
It's just a matter of display, not data analysis. Your results will not be affected, just the display will.Tobiastobie
have a similar issue but i have terra bytes of data over a few years. (sensor data with a sample rate of 40kHz). Now i would like to have an interactive plot which will plot something like a mean value for x hours or even a whole day and when i zoom in it should dynamicly calculate the new mean values and so on till to the point i am this far zoomed in that there wont be any need of a mean calculation. the question is: is there something like this already realised or have i to program it by myselfRemscheid
to my comment before. i mean i know that i have to program at least something but not everything. what i am trying / testing right now is PyViz with some of the integrated toolsRemscheid
S
2

I wonder if there's a win to be had by speeding up lookup of your points? (I've been intrigued by R* (r star) trees for a while.)

I wonder if using something like an r* tree in this case could be the way to go. (when zoomed out, higher up nodes in the tree could contain information about the coarser, zoomed out rendering, nodes further towards the leaves contain the individual samples)

maybe even memory map the tree (or whatever structure you end up using) into memory to keep your performance up and your RAM usage low. (you offload the task of memory management to the kernel)

hope that makes sense.. rambling a bit. it's late!

Sporogonium answered 2/5, 2011 at 8:28 Comment(2)
I have no idea about R. My data-set resembles a csv like {index, floati, floatq}. That repeats 20M times. I'm not sure how you'd build the data-structure you mention in R.Exodus
I think it might be an involved project. I am talking about "R star" trees. wikipedia: en.wikipedia.org/wiki/R*_tree HTHSporogonium

© 2022 - 2024 — McMap. All rights reserved.