I run into the following problem when writing scientific code with Python:
- Usually you write the code iteratively, as a script which perform some computation.
- Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
- Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.
Being an engineer, I know all about the right architecture for this (with work items queued, and worker threads or processes, and work results queued and written to persistent store); but I don't want to implement this myself. The most problematic issue is the need for reruns due to code changes or temporary system issues (e.g. out-of-memory).
I would like to find some framework to which I will provide the wanted inputs (e.g. with a file with one line per run) and then I will be able to just initiate multiple instances of some framework-provided agent which will run my code. If something went bad with the run (e.g. temporary system issue or thrown exception due to bug) I will be able to delete results and run some more agents. If I take too many resources, I will be able to kill some agents without a fear of data-inconsistency, and other agents will pick-up the work-items when they find the time.
Any existing solution? Anyone wishes to share his code which do just that? Thanks!
MapReduce
algorithm to run the scripts in a divide and conquer sort of method. So I would definetly give that a try, good link here, michael-noll.com/tutorials/… – Skindive