Performance discrepancy between OSX and Linux for communication using Python multiprocessing
Asked Answered
R

2

27

I have been trying to learn more about Python's multiprocessing module and to evaluate different techniques for communication between processes. I wrote a benchmark that compares the performance of Pipe, Queue, and Array (all from multiprocessing) for transferring numpy arrays between processes. The full benchmark can be found here. Here's a snippet of the test for Queue:

def process_with_queue(input_queue, output_queue):
    source = input_queue.get()
    dest = source**2
    output_queue.put(dest)


def test_with_queue(size):

    source = np.random.random(size)

    input_queue = Queue()
    output_queue = Queue()

    p = Process(target=process_with_queue, args=(input_queue, output_queue))
    start = timer()
    p.start()
    input_queue.put(source)
    result = output_queue.get()
    end = timer()

    np.testing.assert_allclose(source**2, result)

    return end - start

I ran this test on my Linux laptop and got the following results for an array size of 1000000:

Using mp.Array: time for 20 iters: total=2.4869s, avg=0.12435s
Using mp.Queue: time for 20 iters: total=0.6583s, avg=0.032915s
Using mp.Pipe:  time for 20 iters: total=0.63691s, avg=0.031845s

I was a little surprised to see Array perform so poorly since it uses shared memory and presumably doesn't require pickling, but I assume there must be some copying in numpy that I can't control.

However, I ran the same test (again for array size 1000000) on a Macbook, and got the following results:

Using mp.Array: time for 20 iters: total=1.6917s, avg=0.084587s
Using mp.Queue: time for 20 iters: total=2.3478s, avg=0.11739s
Using mp.Pipe:  time for 20 iters: total=8.7709s, avg=0.43855s

The real timing differences aren't that surprising since of course different systems would exhibit different performance. What is so surprising are the differences in relative timing.

What could account for this? This is a pretty surprising result to me. I wouldn't be surprised to see such stark differences between Linux and Windows, or OSX and Windows, but I sort of assumed that these things would behave very similarly between OSX and Linux.

This question addresses performance differences between Windows and OSX, which seems more expected.

Reality answered 19/12, 2017 at 19:47 Comment(8)
The Value and Array types rely on a Lock to ensure data safety. Acquiring a lock is a fairly expensive action as it requires to switch to kernel mode. On the other hand, serializing simple data structures is what modern CPUs do most of the time so its cost is fairly low. Removing the Lock from the Array should show better performance but you cannot exclude race conditions over the data.Haplite
@Haplite if you look at the full benchmark code you'll see that I am actually not using a lock for the Array portion of the benchmark. And even then this would only account for the poor relative performance of Array on Linux, but it does not necessarily account for the discrepancy between Linux and OSX.Reality
Does your macbook have a solid state drive and your linux laptop a rotating disk?Incident
What about other competing programs ? Do you have more background thread on your mac ? What about testing on 100 or 10000 iteration ?Toothsome
@Hannu, yes you are correct. The macbook has an SSD and my linux laptop does not. I'm not exactly sure how this would cause the observed discrepancies, though.Reality
@romainjouin, that's a really good point. I didn't carefully control for this, but it's safe to assume the background workloads were pretty similar. If I can find time I'll try to run the tests again with higher iteration counts and under more controlled conditions. FWIW, I also observed the same pattern of results for Linux when running under a VM.Reality
It could explain the Array slowness in Linux. Python shared memory implementation appears to create files on file system (see #44747645). I would assume SSD versus a rotating disk would explain the difference there. It does not explain why pipe is so slow on mac, though.Incident
You should consider measuring CPU time instead of wall clock time.Credential
G
7

TL;DR: OSX is faster with Array because calls to the C library slow Array down on Linux

Using Array from multiprocessing uses the C types Python library to make a C call to set memory for the Array. This takes relatively more time on Linux than on OSX. You can also observe this on OSX by using pypy. Setting memory takes much longer using pypy (and GCC and LLVM) than using python3 on OSX (using Clang).

TL;DR: the difference between Windows and OSX lies in the way multiprocessing starts new processes

The major difference is in the implementation of multiprocessing, which works different under OSX than in Windows. The most important difference is the way multiprocessing starts a new process. There are three ways this can be done: using spawn, fork or forkserver. The default (and only supported) way under Windows is spawn. The default way under *nix (including OSX) is fork. This is documented in the Contexts and start methods section of the multiprocessing documentation.

One other reason for the deviation in results is the low number of iterations you take.

If you increase the number of iterations and calculate the number of handled function calls per time unit, you get relatively consistent results between the three methods.

Further analysis: look at the function calls with cProfile

I removed your timeit timer functions and wrapped your code in the cProfile profiler.

I added this wrapper function:

def run_test(iters, size, func):
    for _ in range(iters):
        func(size)

And I replaced the loop in main() with:

for func in [test_with_array, test_with_pipe, test_with_queue]:
    print(f"*** Running {func.__name__} ***")
    pr = cProfile.Profile()
    pr.enable()
    run_test(args.iters, args.size, func)
    pr.disable()
    ps = pstats.Stats(pr, stream=sys.stdout)
    ps.strip_dirs().sort_stats('cumtime').print_stats()

Analysis of the OSX - Linux difference with Array

What I see is that Queue is faster than Pipe, which is faster than Array. Regardsless of the platform (OSX/Linux/Windows), Queue is between 2 and 3 times faster than Pipe. On OSX and Windows, Pipe is around 1.2 and 1.5 times faster than Array. But on Linux, Pipe is around 3.6 times faster than Array. In other words, On Linux, Array is relatively much slower than on Windows and OSX. This is strange.

Using the cProfile data, I compared the performance ratio between OSX and Linux. There are two function calls that take a lot of time: Array and RawArray in sharedctypes.py. These functions are only called in the Array scenario (not in Pipe or Queue). On Linux, these calls take almost 70% of the time, while on OSX only 42% of the time. So this a major factor.

If we zoom in to the code, we see that Array (line 84) calls RawArray, and RawArray (line 54) does nothing special, except a call to ctypes.memset (documentation). So there we have a suspect. Let's test it.

The following code uses timeit to test the performance of setting 1 MB of memory buffer to 'A'.

import timeit
cmds = """\
import ctypes
s=ctypes.create_string_buffer(1024*1024)
ctypes.memset(ctypes.addressof(s), 65, ctypes.sizeof(s))"""
timeit.timeit(cmds, number=100000)

Running this on my MacBookPro and on my Linux server confirms the behaviour that this runs much slower on Linux than on OSX. Knowing that pypy is on OSX compiled using GCC and Apples LLVM, this is more akin to the Linux world than Python, which is on OSX compiled directly against Clang. Normally, Python programs runs faster on pypy than on CPython, but the code above runs 6.4 times slower on pypy (on the same hardware!).

My knowlegde of C toolchains and C libraries is limited, so I can't dig deeper. So my conclusion is: OSX and Windows are faster with Array because memory calls to the C library slow Array down on Linux.

Analysis of the OSX - Windows performance difference

Next I ran this on my dual-boot MacBook Pro under OSX and under Windows. The advantage is that the underlying hardware is the same; only the OS is different. I increased the number of iterations to 1000 and the size to 10.000.

The results are as follows:

  • OSX:
    • Array: 225668 calls in 10.895 seconds
    • Pipe: 209552 calls in 6.894 seconds
    • Queue: 728173 calls in 7.892 seconds
  • Windows:
    • Array: 354076 calls in 296.050 seconds
    • Pipe: 374229 calls in 234.996 seconds
    • Queue: 903705 calls in 250.966 seconds

We can see that:

  1. The Windows implementation (using spawn) takes more calls than OSX (using fork);
  2. The Windows implementation takes much more time per call than OSX.

What's not immediately evident, but relevant to note is that if you look at the average time per call, the relative pattern between the three multiprocessing methodes (Array, Queue and Pipe) is the same (see graphs below). In other words: the differences in performance between Array, Queue and Pipe in OSX and Windows can be completely explained by two factors: 1. the difference in Python performance between the two platforms; 2. the different ways both platforms handle multiprocessing.

In other words: the difference in the number of calls is explained by the Contexts and start methods section of the multiprocessing documentation. The difference in execution time is explained in the performance difference of Python between OSX and Windows. If you factor out those two components, the relative performance of Array, Queue and Pipe are (more or less) comparable on OSX and Windows, as is shown in the graphs below.

Performance differences of Array, Queue and Pipe between OSX and Windows

Guess answered 11/11, 2018 at 14:47 Comment(4)
comprehensive answer, but the question wasn't about Windows... The OP asked about difference between Mac and Linux.Slurry
@CoreyGoldberg : ow... darn. That’s stupid... I ran it on Linux as well. Will add that in a few hours...Guess
@CoreyGoldberg added analysis of the OSX vs. Linux using Array.Guess
@Guess thanks for the very detailed analysis. So to distill your results even further, you're saying that it basically boils down to the difference in performance of ctypes.memset on these platforms? I have no idea why that should be the case. I wonder what the relative performance of memset is in pure C code on these platforms?Reality
H
-4

Well, When we talk about multi-process with python these things happens:

  • The OS does all the multi-tasking work
  • The only option for multi-core concurrency
  • Duplicated use of system resources

There are huge differences between osx and linux. and osx is based on Unix and treats multi tasking process in other way than linux.

Unix installation requires a strict and well-defined hardware machinery and works only on specific CPU machines, and maybe osx is not designed to speed up python processes. This reason may be the cause.

For more details you can read the MultiProcessing documentation.

I hope it helps.

Heedful answered 16/2, 2018 at 14:1 Comment(2)
I would love to learn more about which differences between OSX and Linux are having an effect here. Could you expand your answer a bit on this topic?Narcissus
I believe that OSX andy other OSes is not designed for python.Perfectly

© 2022 - 2024 — McMap. All rights reserved.