Parallel application in python becomes much slower when using mpi rather than multiprocessing module

Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.

The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.

My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.

What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.

y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)

1) Using multiprocessing module - Pool.map enter image description here

2) Using mpi - Scatter/Gather enter image description here

3) Both pictures on top of each other enter image description here

At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.

Does anyone have any explanation for these results?

ADDED:

I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.

However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.

I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.

Pool.map:

def eval_population(func, pop):
    for ind in pop:
        ind.fitness.values = func(ind)

    return pop

# ...
self.pool = Pool(8)
# ...

for iter_ in xrange(nr_of_generations):
    # ...
    self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
    # ...

MPI - Scatter/Gather

def divide_list(lst, n):
    return [lst[i::n] for i in xrange(n)]

def chain_list(lst):
    return list(chain.from_iterable(lst))

def evaluate_individuals_in_groups(func, rank, individuals):
    comm = MPI.COMM_WORLD
    size = MPI.COMM_WORLD.Get_size()

    packages = None
    if not rank:
        packages = divide_list(individuals, size)

    ind_for_eval = comm.scatter(packages)
    eval_population(func, ind_for_eval)

    pop_with_fit = comm.gather(ind_for_eval)

    if not rank:
        pop_with_fit = chain_list(pop_with_fit)
        for index, elem in enumerate(pop_with_fit):
            individuals[index] = elem

for iter_ in xrange(nr_of_generations):
        # ...
        evaluate_individuals_in_groups(self.func, self.rank, pop)
        # ...

ADDED 2: As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result: enter image description here

I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark: enter image description here

Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.

MPI is actually designed to do inter node communication, so talk to other machines over the network. Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.

mpi4py makes a copy for every message, since it's targeted at distributed memory usage. If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.

If you only intend to do computations within the same machine, there is no need to use mpi here.

Some of this is discussed in this thread.

Update The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!

I recommend running the ipc-benchmark on the virtualized machine, and post the results. If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.

Recommended topics

Hot tags