pathos: parallel processing options - Could someone explain the differences?

Asked 26/2, 2018 at 14:24 Answered 9/4, 2018 at 15:52

Solved python parallel-processing multiprocessing pathos parallelism-amdahl

I am trying to run parallel processes under python (on ubuntu).

I started using multiprocessing and it worked fine for simple examples.
Then came the pickle error, and so I switched to pathos. I got a little confused with the different options and so wrote a very simple benchmarking code.

import multiprocessing as mp
from pathos.multiprocessing import Pool as Pool1
from pathos.pools import ParallelPool as Pool2
from pathos.parallel import ParallelPool as Pool3
import time

def square(x):  
    # calculate the square of the value of x
    return x*x

if __name__ == '__main__':

    dataset = range(0,10000)

    start_time = time.time()
    for d in dataset:
        square(d)
    print('test with no cores: %s seconds' %(time.time() - start_time))

    nCores = 3
    print('number of cores used: %s' %(nCores))  


    start_time = time.time()

    p = mp.Pool(nCores)
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with multiprocessing: %s seconds' %(time.time() - start_time))


    start_time = time.time()

    p = Pool1(nCores)
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with pathos multiprocessing: %s seconds' %(time.time() - start_time))


    start_time = time.time()

    p = Pool2(nCores)
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with pathos pools: %s seconds' %(time.time() - start_time))


    start_time = time.time()

    p = Pool3()
    p.ncpus = nCores
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with pathos parallel: %s seconds' %(time.time() - start_time))

I get about
- 0.001s with plain serial code, without parallel,
- 0.100s with multiprocessing option,
- 0.100s with pathos.multiprocessing,
- 4.470s with pathos.pools,
- an AssertionError error with pathos.parallel

I copied how to use these various options from http://trac.mystic.cacr.caltech.edu/project/pathos/browser/pathos/examples.html

I understand that parallel processing is longer than a plain serial code for such a simple example. What I do not understand is the relative performance of pathos.

I checked discussions, but could not understand why pathos.pools is so much longer, and why I get an error (not sure then what the performance of that last option would be).

I also tried with a simple square function, and for that even pathos.multiprocessing is much longer than multiprocessing

Could someone explain the differences between these various options?

Additionally, I ran the pathos.multiprocessing option on a remote computer, running centOS, and performance is about 10 times worse than multiprocessing.

According to company renting the computer, it should work just like a home computer. I understand that it will, maybe, be difficult to provide info without more details on the machine, but if you have any ideas as to where it could come from, that would help.

Shupe answered 26/2, 2018 at 14:24 Comment(4)

Original URL seemed not to be publicly accessible ( 404 Not Found / Code: NoSuchKey ). For implementation details, may review the source-code or ask Mike McKearns ( was active on StackOverflow too ). – Dorri 27/2, 2018 at 12:4

@Dorri - We've received complaints about your use of bizarre formatting in editing other posts (use of inappropriate boldface, [SERIAL] instead of serial, etc.). I've removed that formatting here. Please don't impose your own non-standard style on other posts here. – Breannebrear 6/3, 2018 at 19:15

@BradLarson Yes, you remove any content you decide to. Would you help in disambiguation - what formatting seems to you to be reasonably acceptable for making a difference between the (A) plain text with a word "parallel" ( used in a common, often professionally agnostic speech ) and (B) a computer science terminology term [PARALLEL], which does have a one and only one, very particular meaning, not allowing any other but this very exact C/S-context from theory of systems for one, unique, type of process scheduling? I add cross-references for this very purpose ( if you've noticed ). – Dorri 6/3, 2018 at 19:24

@Dorri - I think it's pretty clear what parallel and serial mean in various contexts. I don't see a need to unilaterally apply formatting to posts that no one else uses. All that will do is further distract people from the content. Before taking actions on your own, perhaps you should ask the community at Meta whether they support it. If they do, I'd be happy to let it stay. At present, however, people are angry you're doing this and are complaining to moderators about it. – Breannebrear 6/3, 2018 at 19:34

I'm the pathos author. Sorry for the confusion. You are dealing with a mix of the old and new programming interface.

The "new" (suggested) interface is to use pathos.pools. The old interface links to the same objects, so it's really two ways to get to the same thing.

multiprocess.Pool is a fork of multiprocessing.Pool, with the only difference being that multiprocessing uses pickle and multiprocess uses dill. So, I'd expect the speed to be the same in most simple cases.

The above pool can also be found at pathos.pools._ProcessPool. pathos provides a small wrapper around several types of pools, with different backends, giving an extended functionality. The pathos-wrapped pool is pathos.pools.ProcessPool (and the old interface provides it at pathos.multiprocessing.Pool).

The preferred interface is pathos.pools.ProcessPool.

There's also the ParallelPool, which uses a different backend -- it uses ppft instead of multiprocess. ppft is "parallel python" which spawns python processes through subprocess and passes source code (with dill.source instead of serialized objects) -- it's intended for distributed computing, or when passing by source code is a better option.

So, pathos.pools.ParallelPool is the preferred interface, and pathos.parallel.ParallelPool (and a few other similar references in pathos) are hanging around for legacy reasons -- but they are the same object underneath.

In summary:

>>> import multiprocessing as mp
>>> mp.Pool()
<multiprocessing.pool.Pool object at 0x10fa6b6d0>
>>> import multiprocess as mp
>>> mp.Pool()
<multiprocess.pool.Pool object at 0x11000c910>
>>> import pathos as pa
>>> pa.pools._ProcessPool()
<multiprocess.pool.Pool object at 0x11008b0d0>
>>> pa.multiprocessing.Pool()
<multiprocess.pool.Pool object at 0x11008bb10>
>>> pa.pools.ProcessPool()
<pool ProcessPool(ncpus=4)>
>>> pa.pools.ParallelPool()
<pool ParallelPool(ncpus=*, servers=None)>

You can see the ParallelPool has servers... thus is intended for distributed computing.

The only remaining question is why the AssertionError? Well that is because the wrapper that pathos adds keeps a pool object available for reuse. Hence, when you call the ParallelPool a second time, you are calling a closed pool. You'd need to restart the pool to enable it to be used again.

>>> f = lambda x:x
>>> p = pa.pools.ParallelPool()
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.restart()  # throws AssertionError w/o this
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.clear()  # destroy the saved pool

The ProcessPool has the same interface as ParallelPool, with respect to restarting and clearing saved instances.

Sparteine answered 9/4, 2018 at 15:52 Comment(4)

can We have multiple servers using pathos? any examples for that? – Solidago 25/8, 2019 at 9:21

pathos.ParallelPool is built on ppft, and uses servers the same way as ppft. Check in the pathos/examples folder, for example: github.com/uqfoundation/pathos/blob/master/examples/…. – Sparteine 25/8, 2019 at 12:40

Hello @MikeMcKerns, I have been trying to use pathos, but I am getting this TypeError: can't pickle _cffi_backend.FFI objects error. Any idea on how to solve this pickle issue? – Okra 16/9, 2022 at 13:24

@TonyMontana: are you getting a pickle-related error with ppft (ParallelPool) or multiprocess (ProcessPool)? It should not happen from the former. There are many reasons that an object can be unpicklable, you should open a ticket on GitHub and address this (as opposed to in the comments here). – Sparteine 16/9, 2022 at 22:42

Could someone explain the differences?

Let's start from some common ground.

Python interpreter uses, as standard, a GIL-stepped code-execution. This means, that all thread-based pools still do wait for a GIL-stepped ordering of all code-execution paths, so any such constructed attempt will not enjoy benefits " theoretically expected ".

Python interpreter may use other, process-based instances for loading multiple process, each having its own GIL-lock, forming a pool of multiple, concurrent code-execution paths.

Having managed this principal dis-ambiguation, the performance-related questions start to appear next. The most responsible approach is to benchmark, benchmark, benchmark. No exception here.

What does it take so much time to spend here ( where )?

Major ( constant ) part is a primarily [TIME]-domain cost of a process-instantiation. Here, the complete replica of python interpreter, including all variables, all memory-maps, indeed a complete state-full-copy of the calling python interpreter has to be first created and placed onto the operating system process-scheduler table, before any further ( useful part of the job ) computing "inside" such successfully instantiated sub-process can take place. If your payload function just immediately returns from there, having created an x*x, your code seems to have burnt all that fuel for a few CPU-instructions and you have spent way more than received in return. Economy of costs goes against you, as all the process-instantiation plus process-termination costs are way higher than a few CPU-CLOCK ticks.

How long does this actually take?
You can benchmark this ( as proposed here, in a proposed Test-Case-A. If Stopwatch()-ed [us] decide, you start to rely on facts more than on any sorts of wannabe-guru or marketing type of advice. That's fair, isn't it? ).

`Test-Case-A` benchmarks process-instantiation costs [MEASURED].
What next?

The next most dangerous ( variable in size ) part is a primarily [SPACE]-domain costs, yet having also the [TIME]-domain impact, if [SPACE]-allocation costs start to grow beyond small footprint scales.

This sort of add-on overhead costs is related to any need to pass "large"-sized parameters, from the "main"-python interpreter to each and every of the ( distributed ) sub-process instances.

How long does this take?
Again, benchmark, benchmark, benchmark. Shall benchmark this ( as proposed here, if extending a there proposed Test-Case-C with a replacement of aNeverConsumedPAR parameter with some indeed "fat"-chunk of data, be it a numpy.ndarray() or other type, bearing some huge memory-footprint. )

This way, the real hardware-related + O/S-related + python-related data-flow costs start to become visible and measured in such a benchmark as additional overhead costs in **[us]**. This is nothing new to ol' hackers, yet, people who never met that an HDD-disk-write times could grow into and block other processing for many seconds or minutes would hardly believe, if not touching by one's own benchmarking the real costs of data-flow. So, do not hesitate to extend the benchmark Test-Case-C to indeed large memory-footprints to smell the smoke ...

Last, but not least, the re-formulated Amdahl's Law will tell ...

given an attempt to parallelise some computation is well understood both as per the computing part and also as per all the overhead-part(s), the picture starts to get complete:

The overhead-strict and resources-aware Amdahl's Law re-formulation shows:

                           1                         
S =  ______________________________________________ ;  where         s,
                    /                     \                    ( 1 - s ),
                   |  ( 1 - s )            |                       pSO,
     s  + pSO + max|  _________ , atomicP  |  + pTO                pTO,
                   |      N                |                         N
                    \                     /           have been defined in
                                                      just an Overhead-strict Law
and
atomicP := is a further indivisible duration of an atomic-process-block

That the resulting speedup S will always suffer from high overhead costs pSO + pTO the same as when whatever high N will not be allowed to further help, because of a high enough value of atomicP.

In all these cases the final speedup S may easily fall under << 1.0, yes, well under a pure-[SERIAL] code-execution path schedule ( again, having benchmarked the real costs of pSO and pTO ( for which the Test-Case-A + Test-Case-C ( extended ) was schematically proposed ) there comes a chance to derive the minimum reasonable computing-payload needed so as to remain above the mystic level of a Speedup >= 1.0

Dorri answered 26/2, 2018 at 15:15 Comment(4)

Thanks @Dorri for the general info and benchmarking tips. My question was more about the differences in performance between multiprocessing and the pathos functions. – Shupe 26/2, 2018 at 16:14

Seems that I did not understand your point in this comment. There are no pieces of general information in the Answer, but steps how to measure the components of the said differences in performance. So, if you run the proposed A/B testing on your platform on your code, the quantitative results will explain you both the scale + the origin of such differences in performance. Knowing why is always more important than just passively observing that it happened to be so slow ( but having no clue why ). Good luck with the right technology for indeed parallel-processing – Dorri 26/2, 2018 at 17:30

Thanks, I will do that to choose the best process for my code. The intention was to try to understand the intention behind having pathos.multiprocessing, pathos.pools, and pathos.parallel. These different options must have been designed with different use cases in mind? I cannot find any information on that. – Shupe 27/2, 2018 at 7:35

Benchmarking will deliver you the answers. Setup + Termination overhead-costs benchmark Test-Case-s will tell you when ( not ) use each of them. Similarly the CPU-bound payload benchmark Test-Case-D + some heavy disk/network activity IO-bound payload based Test-Case-X will demystify the realistic workload / performance landscapes of each of the test-bed setup, which is what you wanted to get, right? So, keep testing and you touch the very truth - no marketing, no dogmatic slogan, no skewed "best practice", the naked facts about how things actually work + you learn darn lot on doing this well. – Dorri 27/2, 2018 at 9:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Could someone explain the differences?

What does it take so much time to spend here ( where )?

Test-Case-A benchmarks process-instantiation costs [MEASURED].What next?

Last, but not least, the re-formulated Amdahl's Law will tell ...

Recommended topics

Hot tags

`Test-Case-A` benchmarks process-instantiation costs [MEASURED].
What next?