I am working on a project, where I want to perform data acquisition, data processing and GUI visualization (using pyqt with pyqtgraph) all in Python. Each of the parts is in principle implemented, but the different parts are not well separated, which makes it difficult to benchmark and improve performance. So the question is:

Is there a good way to handle large amounts of data between different parts of a software?

I think of something like the following scenario:

Acquisition: get data from some device(s) and store them in some data container that can be accessed from somewhere else. (This part should be able to run without the processing and visualization part. This part is time critical, as I don't want to loose data points!)
Processing: take data from the data container, process it, and store the results in another data container. (Also this part should be able to run without the GUI and with a delay after the acquisition (e.g. process data that I recorded last week).)
GUI/visualization: Take acquired and processed data from container and visualize it.
save data: I want to be able to store/stream certain parts of the data to disk.

When I say "large amounts of data", I mean that I get arrays with approximately 2 million data points (16bit) per second that need to be processed and possibly also stored.

Is there any framework for Python that I can use to handle this large amount of data properly? Maybe in form of a data-server that I can connect to.

How much data?

In other words, are you acquiring so much data that you cannot keep all of it in memory while you need it?

For example, there are some measurements that generate so much data, the only way to process them is after-the-fact:

Acquire the data to storage (usually RAID0)
Post-process the data
Analyze the results
Select and archive subsets

Small Data

If your computer system is able to keep pace with the generation of data, you can use a separate Python queue between each stage.

Big Data

If your measurements are creating more data than your system can consume, then you should start by defining a few tiers (maybe just two) of how important your data is:

lossless -- if a point is missing, then you might as well start over
lossy -- if points or a set of data is missing, no big deal, just wait for the next update

One analogy might be a video stream...

lossless -- gold-masters for archival

lossy -- YouTube, Netflix, Hulu might drop a few frames, but your experience doesn't significantly suffer

From your description, the Acquisition and Processing must be lossless, while the GUI/visualization can be lossy.

For lossless data, you should use queues. For lossy data, you can use deques.

Design

Regardless of your data container, here are three different ways to connect your stages:

Producer-Consumer: P-C mimics a FIFO -- one actor generates data and another consumes it. You can build a chain of producers/consumers to accomplish your goal.
Observer: While P-C is typically one-to-one, the observer pattern can also be one-to-many. If you need multiple actors to react when one source changes, the observer pattern can give you that capability.
Mediator: Mediators are usually many-to-many. If each actor can cause the others to react, then all of them can coordinate through the mediator.

It seems like you just need a 1-1 relationship between each stage, so a producer-consumer design looks like it will suit your application.

How much data?

Small Data

Big Data

Design

Recommended topics

Hot tags