Data persistency of scientific simulation data, Mongodb + HDF5?
Asked Answered
P

2

8

I'm developing a Monte Carlo simulation software package that involves multiple physics and simulators. I need to do online analysis, track of the dependency of derived data on raw data, and perform queries like "give me the waveforms for temperature>400 and position near (x0,y0)". So the in-memory data model is rather complicated.

The application is written in Python, with each simulation result modeled as a Python object. In every hour it produces ~100 results (objects). Most objects have heavy data (several MB of binary numeric array), as well as some light data (temperature, position etc). The total data generate rate is several GB per hour.

I need some data persistency solution, and an easy-to-use query API. I've already decided to store the heavy data (numeric array) in HDF5 storage(s). I'm considering using MongoDB as for object persistency (light data only), and for indexing the heavy data in HDF5. Object persistency with MongoDB is straightforward, and the query interface looks sufficiently powerful.

I am aware of the sqlalchemy+sqlite option. However, streaming the heavy data to HDF5 does not seem naturally supported in SqlAlchemy, and a fixed schema is cumbersome.

I am aware of this post( Searching a HDF5 dataset), but the "index table" itself needs some in-memory indices for fast query.

I wonder if there is any alternative solutions I should look at before I jump in? Or is there any problem I've overlooked in my plan?

TIA.

Prudhoe answered 25/1, 2012 at 6:7 Comment(0)
W
3

Some things to know about Mongo which might be relevant to the situation you described and why it might be a good fit:

I need to do online analysis, track of the dependency of derived data on raw data, and perform queries like "give me the waveforms for temperature>400 and position near (x0,y0)".

Mongo has a flexible query language that makes it very easy to do queries like this. Geospatial (2D) indexes are also supported - plus if you need to do queries on position and temperature very frequently, you can create a compound index on (temperature, position) and this will ensure that the query will always perform well.

Most objects have heavy data (several MB of binary numeric array), as well as some light data (temperature, position etc).

Each document in MongoDB can hold up to 16MB of data, and a binary field type is also supported - so it would be relatively simple to embed a few megs of binary into a field, and retrieve it by querying other fields in the data. If you expect to need more than 16MB, you can also use the GridFS API of mongodb, which allows you to store arbitrarily large blobs of binary data on disk and retrieve them quickly.

The total data generate rate is several GB per hour.

For a large, rapidly growing data set like this, you can create a sharded setup which will allow you to add servers to accomodate the size no matter how large it may get.

Westbrooke answered 25/1, 2012 at 15:56 Comment(1)
We need HDF5 for its nice features of storing numeric arrays, e.g. chunking, partial IO, MPI support, lossy and lossless compression, etc. We intend to use HDF5 as the permanent storage, which relatively fixed schema, while using Mongodb as the index when the application runs. The statistical algorithm in the application may change often, so we will rebuild the Mongodb index from the HDF5 storage each time schema has significant change.Prudhoe
F
0

Have you looked at Vistrails?

Faeroese answered 25/1, 2012 at 11:9 Comment(2)
Thanks a lot for the pointer. I skimmed through Vistrails document. The impression I have is that VisTrail is perfect for post-processing and graph-making. I guess my application is more specific to one problem, and needsPrudhoe
to hider much analysis details from the user. I'm reading its source code and see how data are stored in VisTrail, hope I can learn something from them. VisTrail is an impressive piece of software, I wish I had it as a graduate student. Thank you.Prudhoe

© 2022 - 2024 — McMap. All rights reserved.