How to make multi-threading in Julia scale with the number of threads?
Asked Answered
I

3

14

I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.

Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.

I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.

Sequential code

using BenchmarkTools

function slow(n::Int, digits::String)
    total = 0.0
    for i in 1:n
        if !occursin(digits, string(i))
            total += 1.0 / i
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 8.034s

Shared memory parallelism with Threads.@threads on 4 threads

using BenchmarkTools
using Base.Threads

function slow(n::Int, digits::String)
    total = Atomic{Float64}(0)
    @threads for i in 1:n
        if !occursin(digits, string(i))
            atomic_add!(total, 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 6.938s
Speedup: 1.2

Shared memory parallelism with FLoops on 4 threads

using BenchmarkTools
using FLoops

function slow(n::Int, digits::String)
    total = 0.0
    @floop for i in 1:n
        if !occursin(digits, string(i))
            @reduce(total += 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 10.850s
No speedup: slower than the sequential code.

Tests on various numbers of threads (different hardware)

I tested the sequential and Threads.@threads code on a different machine and experimented with various numbers of threads.

Here are the results:

Number of threads Speedup
2 1.2
4 1.2
8 1.0 (no speedup)
16 0.9 (the code takes longer to run than the sequential code)

For heavier computations (n = 1e9 in the code above) which would minimize the relative effect of any overhead, the results are very similar:

Number of threads Speedup
2 1.1
4 1.3
8 1.1
16 0.8 (the code takes longer to run than the sequential code)

For comparison: same loop with Chapel showing perfect scaling

Code run with Chapel v1.23.0:

use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
  if (i: string).find(digits) == -1 then
    total += 1.0 / i;
 }
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
        total, watch.elapsed());

First run (same hardware as the first Julia tests):

Number of threads Time (s) Speedup
1 13.33 n/a
2 7.34 1.8

Second run (same hardware):

Number of threads Time (s) Speedup
1 13.59 n/a
2 6.83 2.0

Third run (different hardware):

Number of threads Time (s) Speedup
1 19.99 n/a
2 10.06 2.0
4 5.05 4.0
8 2.54 7.9
16 1.28 15.6
Inveterate answered 24/2, 2021 at 2:13 Comment(4)
Also, is the equivalent chapel code thread-safe? (I don't know chapel well)Manslayer
Yes, it is thread-safe.Inveterate
Not directly related to the question, but while doing my own timings, I converted the timed computation in Chapel to the more compact/elegant: total = + reduce [i in 1..n] if (i: string).find(digits) == -1 then 1.0 / i else 0.0;Vanzandt
Thank you Brad for the neat compact version! I will leave my less elegant version in the answer as it may be easier to read for non-Chapel experts.Inveterate
P
6

Someone can make a much more detailed analysis than me but the main reason naive Julia threading is performing badly is that your "task" in each iteration is way too light. Using an atomic lock, in this case, will imply huge overhead because all threads are just waiting for the lock way too often.

Since your Chapel code is doing a mapreduce, we can also try a parallel mapreduce in Julia:


julia> function slow(n::Int, digits::String)
           total = 0.0
           for i in 1:n
               if !occursin(digits, string(i))
                   total += 1.0 / i
               end
           end
           "total = $total"
       end
slow (generic function with 1 method)

julia> @btime slow(Int64(1e5), "9")
  6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"

julia> using ThreadsX

julia> function slow_thread_thx(n::Int, digits::String)
           total = ThreadsX.mapreduce(+,1:n) do i
               if !occursin(digits, string(i))
                   1.0 / i
               else
                   0.0
               end
           end
           "total = $total"
       end

julia> @btime slow_thread_thx(Int64(1e5), "9")
  1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"

With 4 threads. I've tested with other numbers of threads and confirmed the scaling is pretty linear.

Btw, just as a general tip, you should try to avoid printing in a benchmarked code because it makes a mess when timed repeatedly and also if your task is fast, STDIO can take nonnegligible time.

Pender answered 25/2, 2021 at 5:10 Comment(12)
Thank you for the info about printing. Does this apply to the use of println() only or does this also apply to printing the way you did?Inveterate
About printing the way you did: this works in the REPL, but you get no output when running the code with julia script.jl (the only output is the time). So it is not very convenient to ensure that the result is correct.Inveterate
I ran your code and, while the speedup is definitely better than with @threads (so thanks for the suggestion!), it still doesn't scale up much: for 2, 4, 8, and 16 threads I get the following speedups: 1.4, 1.7, 2.1, and 1.6.Inveterate
You get a reasonable scaling because your problem size is very small (n = 1e5) and dominated by other things. If you increase the calculation to n = 1e8 as in my example, the scaling falls apart).Inveterate
I thought that your explanation at the top of your answer was very interesting, so I tested it with a large computation repeated few times (instead of a small one repeated many times). It should scale well if this were correct. But it doesn't scale at all either.Inveterate
if you notice the memory allocation, by the time we're running for 1e8, the memory becomes a limiting factor too (9GiB of allocation each run, due to the string) and I think in this case GC is working against us.Pender
Memory is not a limiting factor when I run it on my machine as I have 32GiB. I am also not sure what the memory allocation really is in BenchmarkTools: when I monitor htop while running the code, my memory usage doesn't even increase in a visible manner. When I run the code on 16 threads, I am maxing the CPUs on my computer, but the memory is so little affected, I can't even detect the change.Inveterate
the speed of memory, not the amount of memory. Especially once you have enough threads all doing memory-intense operation. The CPU usage % can be misleading due to how it's defined.Pender
I need to do some research on that. I don't know what the speed of memory is. I will look into it. That said, regardless, Chapel and other languages don't seem to have that limitation since they scale perfectly on the same hardware with the same values. So I am trying to get better results with Julia and it's been difficult so far. Scaling down the problem due to memories difficulties is not exactly a viable option in real situations. It's precisely because the computation is large that I want to run it on several threads.Inveterate
computation being large is not a problem as long as "largeness" is NOT primarily due to memory I/O on string operation, Julia's string isn't the fastest. If this example truly represents your real workload, you would have better experience using Distributed. Since in that case you get parallel GC for free.Pender
Yes, using Distributed scales beautifully up to 32 cores. After that, we ran into out of memory problems.Inveterate
But we really wanted to explore the multi-threading capabilities of Julia here (our real-world use case is that we teach many tools in the context of HPC and we are preparing teaching material in Julia). So we are trying to understand what scales and what doesn't and why. (And how to write proper Julia code too lol).Inveterate
C
4

As jling suggests in the comments on their answer the problem here is most likely that the Julia code is allocating lots of memory that needs to be garbage collected. Chapel is, to my understanding, not a garbage-collected language and that could explain why this example scales more linearly. As a small test of this hypothesis, I benchmarked the following code that performs the same operations but with preallocated Vector{UInt8} instead of String:

using BenchmarkTools
using Transducers
using Distributed

function string_vector!(a::Vector{UInt8}, x::Unsigned)
    n = ndigits(x)
    length(a) < n && error("Vector too short")
    i = n
    @inbounds while i >= 1
        d, r = divrem(x, 0x0a)
        a[i] = 0x30 + r
        x = oftype(x, d)
        i -= 1
    end
    a
end

function slow_no_garbage(n::UInt, digits::String)
    digits = collect(codeunits(digits))
    thread_strings = [zeros(UInt8, 100) for _ in 1:Threads.nthreads()]
    fun(i) = if Base._searchindex(string_vector!(thread_strings[Threads.threadid()], i), digits, 1) == 0
            1.0 / i
        else
            0.0
        end
    total = foldxt(+, Map(fun), 0x1:n)
    "total = $total"
end
println(@btime slow_no_garbage(UInt(1e8), "9"))

I do not recommend using this code (especially since, because the numbers are always growing in length I don't properly clear the thread buffer between iterations, although that is easily fixed). However, it results in almost linear scaling with the number of threads (table at the end of the answer).

As jling also mentioned, if a lot of garbage is created distribution may be better than threading. The following two code snippets use Transducers.jl to run the code first using threads:

using BenchmarkTools
using Transducers

function slow_transducers(n::Int, digits::String)
    fun(i) = if !occursin(digits, string(i))
            1.0 / i
        else
            0.0
        end
    total = foldxt(+, Map(fun), 1:n)
    "total = $total"
end

println(@btime slow_transducers(Int64(1e8), "9"))

and then distributed to separate Julia processes (taking the number of processes as the first command-line argument):

using BenchmarkTools
using Transducers
using Distributed

function slow_distributed(n::Int, digits::String)
    fun(i) = if !occursin(digits, string(i))
            1.0 / i
        else
            0.0
        end
    total = foldxd(+, Map(fun), 1:n)
   "total = $total"
end

addprocs(parse(Int, ARGS[1]))
println(@btime slow_distributed(Int64(1e8), "9"))

The following table shows the results of running all versions with different number of threads/processes:

n jling slow_transducers slow_distributed slow_no_garbage Chapel
1 4.242 s 4.224 s 4.241 s 2.743 s 7.32 s
2 2.952 s 2.958 s 2.168 s 1.447 s 3.73 s
4 2.135 s 2.147 s 1.163 s 0.716105 s 1.9 s
8 1.742 s 1.741 s 0.859058 s 0.360469 s

Speedup:

n jling slow_transducers slow_distributed slow_no_garbage Chapel
1 1.0 1.0 1.0 1.0 1.0
2 1.43699 1.42799 1.95618 1.89565 1.96247
4 1.98689 1.9674 3.6466 3.83044 3.85263
8 2.43513 2.42619 4.9368 7.60953
Countersink answered 28/2, 2021 at 13:47 Comment(6)
Thanks for providing more material for thought. Note that I am not interested in Distributed here as I am trying to explore multi-threading in Julia, not speed up code. Maybe I will change the title of my question to make this more clear.Inveterate
This would help explain why some of our tests were scaling well and others not at all. So this was very useful. Thank you. We will do more testing exploring this.Inveterate
Well, we ran some loops without any string conversion and they weren't scaling any better.Inveterate
That's interesting, could you share some example of such code, add it to the original post or share a link to it? Is it allocating a lot of memory? Could it be fixed by preallocation like slow_no_garbage?Countersink
Sorry for my slow reply. Yes, I already thought that I should edit my question with an example not involving string conversion since everybody is focusing on this. I will look at your solution more closely and test it on other loops, then I'll edit the question. Thanks for the suggestion.Inveterate
Sorry for being so slow in following up on all this. I'll come back to this when I find the time, but for now I just wanted to thank you for introducing me to Transducers.jl in your example.Inveterate
C
2

As pointed out by previous answer, I also found the performance of multi-threading in Julia is largely influenced by garbage collection.

I used a simple trick by adding GC.gc() before the multi-threading task to "clean" the previous garbage. Note: this only works when the memory allocation is not too large.

BTW, you can use GC.enable_logging(true) to get the idea of how long GC takes (It is huge in my code!)

Churn answered 17/12, 2022 at 3:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.