Harvesting the power of highly-parallel computers with python scientific code [closed]

Asked 14/8, 2013 at 14:28 Answered 22/8, 2013 at 13:12

Solved python concurrency parallel-processing multiprocessing scientific-computing

I run into the following problem when writing scientific code with Python:

Usually you write the code iteratively, as a script which perform some computation.
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.

Being an engineer, I know all about the right architecture for this (with work items queued, and worker threads or processes, and work results queued and written to persistent store); but I don't want to implement this myself. The most problematic issue is the need for reruns due to code changes or temporary system issues (e.g. out-of-memory).

I would like to find some framework to which I will provide the wanted inputs (e.g. with a file with one line per run) and then I will be able to just initiate multiple instances of some framework-provided agent which will run my code. If something went bad with the run (e.g. temporary system issue or thrown exception due to bug) I will be able to delete results and run some more agents. If I take too many resources, I will be able to kill some agents without a fear of data-inconsistency, and other agents will pick-up the work-items when they find the time.

Any existing solution? Anyone wishes to share his code which do just that? Thanks!

Abundant answered 14/8, 2013 at 14:28 Comment(7)

I don't use it myself, but python for mpi is commonly used for this sort of thing. Since mpi is already huge in the scientific computing space, it may fit into your existing architecture. – Brynn 14/8, 2013 at 15:47

Being an engineer, I know all ... but I don't want to implement this myself. reads more like a manager to me. Does your machine have a job management system installed ? – Barrow 14/8, 2013 at 16:41

@High Performance Mark: Ouch.. I prefer to call this being focused. :) But really, scientific code is not like production code; if writing some experimental code takes half a day it's not ok it will take another half a day to make it run on a cluster, which is the case for me today. – Abundant 14/8, 2013 at 21:9

@tdelaney: mpi is a low-level API which has to do with scheduling and communication; I need something build on top of it which will just run my code with arbitrary parallelism. – Abundant 14/8, 2013 at 21:12

I have run into this type of problem myself. What I did was I used the Apache Hadoop framework with the google MapReduce algorithm to run the scripts in a divide and conquer sort of method. So I would definetly give that a try, good link here, michael-noll.com/tutorials/… – Skindive 16/8, 2013 at 18:20

@UriCohen: you were so focused on the first part of my earlier comment that you didn't answer the question I asked, which was not entirely random. Does your machine have a job management system installed ? – Barrow 18/8, 2013 at 17:12

@High Performance Mark: Yes, we have qsub system which manage machine allocation and queues jobs; but it doesn't help me AFAIK in implementing work-item allocation between parallel jobs. – Abundant 18/8, 2013 at 19:22

First of all, I would like to stress that the problem that Uri described in his question is indeed faced by many people doing scientific computing. It may be not easy to see if you work with a developed code base that has a well defined scope - things do not change as fast as in scientific computing or data analysis. This page has an excellent description why one would like to have a simple solution for parallelizing pieces of code.

So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!

Implacable answered 19/8, 2013 at 2:26 Comment(1)

I agree. Great answer. – Abundant 22/8, 2013 at 19:43

I might be wrong, but simply using GNU command line utilities, like parallel, or even xargs, seems appropriate to me for this case. Usage might look like this:

cat inputs | parallel ./job.py --pipe > results 2> exceptions

This will execute job.py for every line of inputs in parallel, output successful results into results, and failed ones to exceptions. A lot of examples of usage (also for scientific Python scripts) can be found in this Biostars thread.

And, for completeness, Parallel documentation.

Parotic answered 20/8, 2013 at 1:4 Comment(1)

That's a nice and simple solution for any machine without a management layer. Thanks! – Abundant 22/8, 2013 at 19:22

If you with "have access to a ~100 CPUs machines" mean that you have access to 100 machines each having multiple CPUs and in case you want a system that is generic enough for different kinds of applications, then the best possible (and in my opinion only) solution is to have a management layer between your resources and your job input. This is nothing Python-specific at all, it is applicable in a much more general sense. Such a layer manages the computing resources, assigns tasks to single resource units and monitors the entire system for you. You make use of the resources via a well-defined interface as provided by the management system. Such as management system is usually called "batch system" or "job queueing system". Popular examples are SLURM, Sun Grid Engine, Torque, .... Setting each of them up is not trivial at all, but also your request is not trivial.

Python-based "parallel" solutions usually stop at the single-machine level via multiprocessing. Performing parallelism beyond a single machine in an automatic fashion requires a well-configured resource cluster. It usually involves higher level mechanisms such as the message passing interface (MPI), which relies on a properly configured resource system. The corresponding configuration is done on the operating system and even hardware level on every single machine involved in the resource pool. Eventually, a parallel computing environment involving many single machines of homogeneous or heterogeneous nature requires setting up such a "batch system" in order to be used in a general fashion.

You realize that you don't get around the effort in properly implementing such a resource pool. But what you can do is totally isolate this effort form your application layer. You once implement such a managed resource pool in a generic fashion, ready to be used by any application from a common interface. This interface is usually implemented at the command line level by providing job submission, monitoring, deletion, ... commands. It is up to you to define what a job is and which resources it should consume. It is up to the job queueing system to assign your job to specific machines and it is up to the (properly configured) operating system and MPI library to make sure that the communication between machines is working.

In case you need to use hardware distributed among multiple machines for one single application and assuming that the machines can talk to each other via TCP/IP, there are Python-based solutions implementing so to say less general job queueing systems. You might want to have a look at http://python-rq.org/ or http://itybits.com/pyres/intro.html (there are many other comparable systems out there, all based on an independent messaging / queueing instance such as Redis or ZeroMQ).

Holmgren answered 17/8, 2013 at 18:20 Comment(0)

So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!

Implacable answered 19/8, 2013 at 2:26 Comment(1)

I agree. Great answer. – Abundant 22/8, 2013 at 19:43

Usually you write the code iteratively, as a script which perform some computation.

This makes me think you'd really like ipython notebooks

A notebook is a file that has a structure which is a mix between a document and an interactive python interpreter. As you edit the python parts of the document they can be executed and the output embedded in the document. It's really good programming where you're exploring the problem space, and want to make notes as you go.

It's also heavily integrated with matplotlib, so you can display graphs inline. You can embed Latex math inline, and many media objects types such as pictures and video.

Here's a basic example, and a flashier one

Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.

Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.

This makes me think you'd really like ipython clusters

iPython clusters allow you to run parallel programs across multiple machines. Programs can either be SIMD (which sound like your case) or MIMD style. Programs can be edited and debugged interactively.

There were several talks about iPython at the recent SciPy event. Going onto PyVideo.org and searching gives numerous videos, including:

I not watched all of these, but they're probably a good starting point.

Probation answered 22/8, 2013 at 13:12 Comment(0)

Recommended topics

Hot tags