Julia Multithreading is slowing down when the number of threads is high
Asked Answered
F

1

2

I am not much experienced in parallel programming. But, I encounter an interesting situation when trying to run my Julia code in parallel.

@threads for i in 1:THREADS
   run(parameters[i], tm)
end

parameter::Vector{Parameters} is a vector of mutable structs and tm is the termination time for each thread. There is no atomic variable. The average numbers of iterations for various values of the THREADS variable are as follows:

THREADS  Itertations
1         35087
2         44079
3         50220
4         43701
5         39624
6         38986
7         34625
8         35810
9         29248
10        28075
11        20376
12        27342

The highest number of computations is observed when THREAD=3. My CPU is an Apple M2 Pro 12-core 3480 MHz. I must note that the run(.,.) function contains a genetic algorithm procedure and it creates and deletes too many objects in memory. So I suspect it's a slowdown caused by Julia's garbage collection system. If you have any ideas about why it peaks when THREAD=3, I'd love to know. Thank you.

Fullmouthed answered 26/12, 2023 at 15:58 Comment(4)
What is the code of run? There is no reason to assume a function scale well in such a case. The GC is indeed often the culprit. It can be also the memory (which does not scale), the CPU (power limitation, frequency scaling, etc.) or the OS scheduling. Julia generally print (an estimation of) the time taken by the GC (though it is not always close to be accurate). Reducing allocations is the key to reduce the overhead of the GC.Mariomariology
Thank you for your time. Code for run consists of too many lines and sub-functions. But in each iteration of the run, many mutable structs are created. Could the threads be locked due to the memory allocation process?Fullmouthed
This is unlikely but possible. One certainly need to look the Julia's code to be sure. You can find some information about the GC of Julia here. The GC is itself parallel but note that doing many allocation make data structure slower to track by the GC. More allocations generally means more GC (partial) collections and thus a higher overhead. Since AFAIK nearly all GC tends not to scale (in fact the RAM can be saturated with mark&sweep GCs), I expect this to be a bottleneck with many cores.Mariomariology
Note that since Julia use a generational GC, the way you allocate objects and keep references impact the performance of the GC steps. For example, short-lived object in loops that does not escape it can be collected faster than others. Such details matters for performance. If you can, try to preallocate (and recycle) the data structures, and use basic arrays as much as possible to avoid allocations in the first place.Mariomariology
W
1

Unfortunately, Julia's garbage collector is not efficient when there are many of threads. As far as I know this is being improved from version to version and eg. moving from 1.9 to the today released 1.10 could lead to some performance gain.

As rule of thumb the best way to go with massively parallel code that allocates a lot is to go with multiprocessing instead of multithreading. You should observe a significant performance gain:

using Distributed 
addprocs(12)
@everywhere using MyPackage
@sync @distributed for i in 1:nworkers()
   run(parameters[i], tm)
end

Or depending on the result aggregation patter:

@everywhere using MyPackage, DataFrames
@distributed (append!) for i in 1:nworkers()
   res = run(parameters[i], tm)
   DataFrame(;res)
end
Warnke answered 27/12, 2023 at 22:17 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.