Parallel application in python becomes much slower when using mpi rather than multiprocessing module
Asked Answered
S

1

20

Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.

The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.

My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.

What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.

y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)

1) Using multiprocessing module - Pool.map enter image description here

2) Using mpi - Scatter/Gather enter image description here

3) Both pictures on top of each other enter image description here

At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.

Does anyone have any explanation for these results?

ADDED:

I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.

However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.

I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.

Pool.map:

def eval_population(func, pop):
    for ind in pop:
        ind.fitness.values = func(ind)

    return pop

# ...
self.pool = Pool(8)
# ...

for iter_ in xrange(nr_of_generations):
    # ...
    self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
    # ...

MPI - Scatter/Gather

def divide_list(lst, n):
    return [lst[i::n] for i in xrange(n)]

def chain_list(lst):
    return list(chain.from_iterable(lst))

def evaluate_individuals_in_groups(func, rank, individuals):
    comm = MPI.COMM_WORLD
    size = MPI.COMM_WORLD.Get_size()

    packages = None
    if not rank:
        packages = divide_list(individuals, size)

    ind_for_eval = comm.scatter(packages)
    eval_population(func, ind_for_eval)

    pop_with_fit = comm.gather(ind_for_eval)

    if not rank:
        pop_with_fit = chain_list(pop_with_fit)
        for index, elem in enumerate(pop_with_fit):
            individuals[index] = elem

for iter_ in xrange(nr_of_generations):
        # ...
        evaluate_individuals_in_groups(self.func, self.rank, pop)
        # ...

ADDED 2: As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result: enter image description here

I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark: enter image description here

Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.

Spaulding answered 11/6, 2013 at 21:52 Comment(6)
MPI implementations typically have many different algorithms for collective operations like scatter and gather. Most libraries have their own heuristics to select the best algorithm but sometimes it fails. Some libraries allow these algorithms to be forced, e.g. in Open MPI this could be achieved by passing MCA arguments to mpiexec. It would help if you tell us what MPI implementation do you use.Holler
I use Open MPI, added to the question.Spaulding
It would be great if you can provide more context, for example show the parts of the code where the data is being distributed (both implementations). Usually scatter and gather in Open MPI scale very well with the number of processes when they are all running on a single node.Holler
I could take a deeper look into your issue, but only after ISC'13 is over in the end of next week. Hope that you'll find the culprit until then.Holler
did you ever find out the reasons? This is exactly the results im getting on 2 proc 6 core xeon as wellGenetics
@HristoIliev for your first comment, why would this affect the performance of a process loading a file? Loading a file with enough cores that hyperthreading is being used doubles the read time of that file compared to multiprocessing which stays consistentGenetics
C
6

MPI is actually designed to do inter node communication, so talk to other machines over the network. Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.

mpi4py makes a copy for every message, since it's targeted at distributed memory usage. If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.

If you only intend to do computations within the same machine, there is no need to use mpi here.

Some of this is discussed in this thread.

Update The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!

I recommend running the ipc-benchmark on the virtualized machine, and post the results. If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.

Candlelight answered 21/6, 2013 at 13:32 Comment(3)
I'm working on an application for testing evolutionary algorithms in different computing environments: PCs, clusters, grid etc... I'm not arguing whether or not it's justified to use MPI here. I'm just curious where the overhead come from. And by the way python's pool.map uses a queue for communication, which is implemented either with pipes or with sockets. I configured it to use sockets, so the communication method is the same.Spaulding
@Spaulding it's not because you're using sockets you're using tcp. Here are some interesting slides on the different speeds you might get using pipes, sockets or shmem, + a benchmarking tool which will tell you what might be the fastest on your system (depending on how your cores can communicate, how your cpu sockets can communicate, and their numa domain.) anil.recoil.org/talks/fosdem-io-2012.pdfCandlelight
thanks for pointing that out. I will check what you wrote and the website you gave when I have some time.Spaulding

© 2022 - 2024 — McMap. All rights reserved.