I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.
Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.
I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.
Sequential code
using BenchmarkTools
function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 8.034s
Shared memory parallelism with Threads.@threads
on 4 threads
using BenchmarkTools
using Base.Threads
function slow(n::Int, digits::String)
total = Atomic{Float64}(0)
@threads for i in 1:n
if !occursin(digits, string(i))
atomic_add!(total, 1.0 / i)
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 6.938s
Speedup: 1.2
Shared memory parallelism with FLoops on 4 threads
using BenchmarkTools
using FLoops
function slow(n::Int, digits::String)
total = 0.0
@floop for i in 1:n
if !occursin(digits, string(i))
@reduce(total += 1.0 / i)
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 10.850s
No speedup: slower than the sequential code.
Tests on various numbers of threads (different hardware)
I tested the sequential and Threads.@threads
code on a different machine and experimented with various numbers of threads.
Here are the results:
Number of threads | Speedup |
---|---|
2 | 1.2 |
4 | 1.2 |
8 | 1.0 (no speedup) |
16 | 0.9 (the code takes longer to run than the sequential code) |
For heavier computations (n = 1e9
in the code above) which would minimize the relative effect of any overhead, the results are very similar:
Number of threads | Speedup |
---|---|
2 | 1.1 |
4 | 1.3 |
8 | 1.1 |
16 | 0.8 (the code takes longer to run than the sequential code) |
For comparison: same loop with Chapel showing perfect scaling
Code run with Chapel v1.23.0:
use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
if (i: string).find(digits) == -1 then
total += 1.0 / i;
}
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
total, watch.elapsed());
First run (same hardware as the first Julia tests):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 13.33 | n/a |
2 | 7.34 | 1.8 |
Second run (same hardware):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 13.59 | n/a |
2 | 6.83 | 2.0 |
Third run (different hardware):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 19.99 | n/a |
2 | 10.06 | 2.0 |
4 | 5.05 | 4.0 |
8 | 2.54 | 7.9 |
16 | 1.28 | 15.6 |