I'm developing a Monte Carlo simulation software package that involves multiple physics and simulators. I need to do online analysis, track of the dependency of derived data on raw data, and perform queries like "give me the waveforms for temperature>400 and position near (x0,y0)". So the in-memory data model is rather complicated.
The application is written in Python, with each simulation result modeled as a Python object. In every hour it produces ~100 results (objects). Most objects have heavy data (several MB of binary numeric array), as well as some light data (temperature, position etc). The total data generate rate is several GB per hour.
I need some data persistency solution, and an easy-to-use query API. I've already decided to store the heavy data (numeric array) in HDF5 storage(s). I'm considering using MongoDB as for object persistency (light data only), and for indexing the heavy data in HDF5. Object persistency with MongoDB is straightforward, and the query interface looks sufficiently powerful.
I am aware of the sqlalchemy+sqlite option. However, streaming the heavy data to HDF5 does not seem naturally supported in SqlAlchemy, and a fixed schema is cumbersome.
I am aware of this post( Searching a HDF5 dataset), but the "index table" itself needs some in-memory indices for fast query.
I wonder if there is any alternative solutions I should look at before I jump in? Or is there any problem I've overlooked in my plan?
TIA.