TL;DR: OSX is faster with Array because calls to the C library slow Array down on Linux
Using Array
from multiprocessing
uses the C types Python library to make a C call to set memory for the Array. This takes relatively more time on Linux than on OSX. You can also observe this on OSX by using pypy. Setting memory takes much longer using pypy (and GCC and LLVM) than using python3 on OSX (using Clang).
TL;DR: the difference between Windows and OSX lies in the way multiprocessing starts new processes
The major difference is in the implementation of multiprocessing
, which works different under OSX than in Windows. The most important difference is the way multiprocessing
starts a new process. There are three ways this can be done: using spawn
, fork
or forkserver
. The default (and only supported) way under Windows is spawn
. The default way under *nix (including OSX) is fork
. This is documented in the Contexts and start methods section of the multiprocessing
documentation.
One other reason for the deviation in results is the low number of iterations you take.
If you increase the number of iterations and calculate the number of handled function calls per time unit, you get relatively consistent results between the three methods.
Further analysis: look at the function calls with cProfile
I removed your timeit
timer functions and wrapped your code in the cProfile
profiler.
I added this wrapper function:
def run_test(iters, size, func):
for _ in range(iters):
func(size)
And I replaced the loop in main()
with:
for func in [test_with_array, test_with_pipe, test_with_queue]:
print(f"*** Running {func.__name__} ***")
pr = cProfile.Profile()
pr.enable()
run_test(args.iters, args.size, func)
pr.disable()
ps = pstats.Stats(pr, stream=sys.stdout)
ps.strip_dirs().sort_stats('cumtime').print_stats()
Analysis of the OSX - Linux difference with Array
What I see is that Queue is faster than Pipe, which is faster than Array. Regardsless of the platform (OSX/Linux/Windows), Queue is between 2 and 3 times faster than Pipe. On OSX and Windows, Pipe is around 1.2 and 1.5 times faster than Array. But on Linux, Pipe is around 3.6 times faster than Array. In other words, On Linux, Array is relatively much slower than on Windows and OSX. This is strange.
Using the cProfile data, I compared the performance ratio between OSX and Linux. There are two function calls that take a lot of time: Array
and RawArray
in sharedctypes.py
. These functions are only called in the Array scenario (not in Pipe or Queue). On Linux, these calls take almost 70% of the time, while on OSX only 42% of the time. So this a major factor.
If we zoom in to the code, we see that Array
(line 84) calls RawArray
, and RawArray
(line 54) does nothing special, except a call to ctypes.memset
(documentation). So there we have a suspect. Let's test it.
The following code uses timeit to test the performance of setting 1 MB of memory buffer to 'A'.
import timeit
cmds = """\
import ctypes
s=ctypes.create_string_buffer(1024*1024)
ctypes.memset(ctypes.addressof(s), 65, ctypes.sizeof(s))"""
timeit.timeit(cmds, number=100000)
Running this on my MacBookPro and on my Linux server confirms the behaviour that this runs much slower on Linux than on OSX. Knowing that pypy is on OSX compiled using GCC and Apples LLVM, this is more akin to the Linux world than Python, which is on OSX compiled directly against Clang. Normally, Python programs runs faster on pypy than on CPython, but the code above runs 6.4 times slower on pypy (on the same hardware!).
My knowlegde of C toolchains and C libraries is limited, so I can't dig deeper. So my conclusion is: OSX and Windows are faster with Array because memory calls to the C library slow Array down on Linux.
Analysis of the OSX - Windows performance difference
Next I ran this on my dual-boot MacBook Pro under OSX and under Windows. The advantage is that the underlying hardware is the same; only the OS is different. I increased the number of iterations to 1000 and the size to 10.000.
The results are as follows:
- OSX:
- Array: 225668 calls in 10.895 seconds
- Pipe: 209552 calls in 6.894 seconds
- Queue: 728173 calls in 7.892 seconds
- Windows:
- Array: 354076 calls in 296.050 seconds
- Pipe: 374229 calls in 234.996 seconds
- Queue: 903705 calls in 250.966 seconds
We can see that:
- The Windows implementation (using
spawn
) takes more calls than OSX (using fork
);
- The Windows implementation takes much more time per call than OSX.
What's not immediately evident, but relevant to note is that if you look at the average time per call, the relative pattern between the three multiprocessing methodes (Array, Queue and Pipe) is the same (see graphs below). In other words: the differences in performance between Array, Queue and Pipe in OSX and Windows can be completely explained by two factors: 1. the difference in Python performance between the two platforms; 2. the different ways both platforms handle multiprocessing.
In other words: the difference in the number of calls is explained by the Contexts and start methods section of the multiprocessing
documentation. The difference in execution time is explained in the performance difference of Python between OSX and Windows. If you factor out those two components, the relative performance of Array, Queue and Pipe are (more or less) comparable on OSX and Windows, as is shown in the graphs below.
Value
andArray
types rely on aLock
to ensure data safety. Acquiring a lock is a fairly expensive action as it requires to switch to kernel mode. On the other hand, serializing simple data structures is what modern CPUs do most of the time so its cost is fairly low. Removing theLock
from theArray
should show better performance but you cannot exclude race conditions over the data. – HapliteArray
portion of the benchmark. And even then this would only account for the poor relative performance ofArray
on Linux, but it does not necessarily account for the discrepancy between Linux and OSX. – Reality