"Logbook" for scientific simulations

Asked 3/11, 2011 at 10:7 Answered 8/1, 2012 at 1:26

I'm using C++ to perform scientific simulation on some things. At this moment, due to the increasing number of parameters, I found necessary to have a "logbook": a file where all the information about a given simulation is stored (not the output; the parameters that led to that output and the respective git commit).

I've searched and it seems to me that the use of XML should be a good option, since it can easily be parsed using python, mathematica or other analysis software.

I wonder if anyone agrees with this, or has a better option.

Besides, I wonder how can I pick the current commit of git to save it on the logbook.

Irenairene answered 3/11, 2011 at 10:7 Comment(1)

There are much nicer and more flexible markup languages that I think would be a better fit for you describe... ie yaml/json. They are human readable, easily modified by hand, and can be loaded easily ie in python just load it and have all the data in objects which can be easily manipulated. I also know of people using databases for this sort of thing... sqllite with a nice library for your environment (python or R or whatever) and you should be able to easily whip up some scripts to extract the information you want. – Gesticulative 3/11, 2011 at 14:13

In general I agree with you:

XML is widely deployed, there's tonnes of tools to bring the logbook into shape.
It's flexible, you can add additional attributes later without breaking old ``scripts''
It's file based, one document, one file, use the filesystem to organise logbook ``pages''
It's file based and plain text, tools like find, grep, diff (at a push) can help you in urgent cases
It's your own solution, you're free to track any information you need, and if you deem it essential to associate sunlight hours with the parameters, do it.

That being said, I should add the storage format depends on the typical use case, if you need to find out why every monday after a full moon the optimiser cannot find any solutions, it will be hard (well, harder) to come up with the necessary XPath/XQuery hackery to do that because of the non-normativity of your structure.

Well all the downsides I can think of:

It's verbose, XML documents in my area tend to be more like 20 to 40 GBs whereas the info probably could be represented in more like 500 MB.
It's slow (depends on how you use it), RDBMs or even nosql solutions employ techniques like indexing to make reading faster.
It's flexible, that's also a downside: If you happen to add two new attributes per day you will end up with nothing but a marked up free text, it will need thorough polishing if you want to import it into structure-focussed systems (SQL, csv, json, ...)
It's your own solution, you have to write it and maintain it

As for the second bit: git describe --always HEAD

Huckaback answered 3/11, 2011 at 10:44 Comment(0)

The easiest option is to make your program a pure function, i.e. externalize all changing and possibly changing parameters into program options so that a simulation is completely specified by the options and a git commit identifier.

Boost.Program_options aids greatly in implementing such a scheme.

Histrionic answered 3/11, 2011 at 10:36 Comment(0)

This may sound odd on a programming site, but I found doing several bits of simulation work that the best log book was...well...a log book.

Specifically, I've used this one extensively (link to Amazon). It may because I came from a wet lab/biology background, but I found something appealing about an old dead tree notebook. It's admittedly not automated, and won't do well if you're running a huge number of different parameter combinations or if your simulation has a large number of parameters to begin with.

But for the project I was working on, which has ~ 20 or so parameters that might vary, I liked being able to record freeform notes about my thoughts, have them in an easily portable, easy to recall and fairly durable form, and for many fellow lab mates, "Keep a lab notebook" seemed to work better with a physical thing.

Your milage may, of course, vary.

Gervase answered 8/1, 2012 at 1:26 Comment(1)

Yes, I used it too and most of the time a 'real' logbook is fundamental. The point is that I save all the data from my numerical experiments, and a good 'virtual' platform to associate the parameters with the results is important too. Nevertheless, one thing DOES NOT substitute the other... – Irenairene 9/1, 2012 at 13:43

You could also tag the particular commit. See http://book.git-scm.com/3_git_tag.html for details.

Birkett answered 3/11, 2011 at 12:22 Comment(0)

Use comma-separated or tab-delimited values. Human-readable and editable, little storage overhead, easily importable into just about anything (including R and excel).

Daman answered 3/11, 2011 at 15:8 Comment(0)

The particle physics world mostly uses ROOT for it's data collection, storage and analysis needs. This includes data from simulation. ROOT makes is possible--indeed easy--to keep a full set of meta data with the results.

Generally when we have large data sets we also keep a database, but that is to make it convenient to construct queries: the real record keeping is in the included metadata.

Byte answered 4/12, 2011 at 2:7 Comment(0)

Recommended topics

Hot tags