What is a good storage candidate for soft-realtime data acquisition under Linux?

Asked 26/10, 2012 at 9:55 Answered 30/10, 2012 at 18:52

Solved python linux storage hdf5 data-acquisition

I'm building a system for data acquisition. Acquired data typically consists of 15 signals, each sampled at (say) 500 Hz. That is, each second approx 15 x 500 x 4 bytes (signed float) will arrive and have to persisted.

The previous version was built on .NET (C#) using a DB4O db for data storage. This was fairly efficient and performed well.

The new version will be Linux-based, using Python (or maybe Erlang) and ... Yes! What is a suitable storage-candidate?

I'm thinking MongoDB, storing each sample (or actually a bunch of them) as BSON objects. Each sample (block) will have a sample counter as a key (indexed) field, as well as a signal source identification.

The catch is that I have to be able to retrieve samples pretty quickly. When requested, up to 30 seconds of data have to be retrieved in much less than a second, using a sample counter range and requested signal sources. The current (C#/DB4O) version manages this OK, retrieving data in much less than 100 ms.

I know that Python might not be ideal performance-wise, but we'll see about that later on.

The system ("server") will have multiple acquisition clients connected, so the architecture must scale well.

Edit: After further research I will probably go with HDF5 for sample data and either Couch or Mongo for more document-like information. I'll keep you posted.

Edit: The final solution was based on HDF5 and CouchDB. It performed just fine, implemented in Python, running on a Raspberry Pi.

Kornher answered 26/10, 2012 at 9:55 Comment(3)

Any specific reason you'd choose PostgreSQL? – Kornher 30/10, 2012 at 16:54

Postgres is scalable and designed for speed. Besides, your data seems to be fixed enough, so mongodb doesn't give you much advantage in that area and speedwise it loses to pg. – Gassman 30/10, 2012 at 17:53

Also, if you want to use Python, consider SQLAlchemy. – Gassman 30/10, 2012 at 17:55

you could have a look into using HDF5 ... It is designed for streamed data, allows time-indexed seeking and (as far as I know) is pretty well supported in Python

Viveca answered 30/10, 2012 at 18:52 Comment(2)

Thank you! That was really helpful! Checking it out right now. It seems to match my requirements perfectly. – Kornher 31/10, 2012 at 8:28

I'm going with HDF5. Thank you! – Kornher 23/11, 2012 at 13:31

Using the keys you described, you should able to scale via sharding if necesssary. 120kB / 30sec ist not that much, so i think you do not need to shard too early.

If you compare that to just using files you'll get more sophisticated queries and build in replication for high availability, DS or offline processing (Map Reduce etc).

Cacophony answered 30/10, 2012 at 16:33 Comment(0)

-1

In your case, you could just create 15 files and save each sample sequentially into the corresponding file. This will make sure the requested samples are stored continuous on disk and hence reduce the number of disk seeks while reading.

Abreu answered 27/10, 2012 at 17:2 Comment(0)

Recommended topics

Hot tags