Your bug is unrelated to your code -- it is your use of -fext-core
to print the output of compilation in the External Core format. Just don't do that (to see the core, I use ghc-core).
Compile with -O2
and -threaded
:
$ ghc -O2 -rtsopts --make A.hs -threaded
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
Then run with +RTS -N4
, for example, to use 4 threads:
$ time ./A +RTS -N4
[0.0,0.0,8.4375e-3,8.4375e-3,50.009375,50.009375,100.0,100.0]
./A 0.00s user 0.00s system 85% cpu 0.008 total
So it completes too quickly to see a result. I'll increase your m
and p
parameters to 1k and 3k
$ time ./A +RTS -N2
./A +RTS -N2 3.03s user 1.33s system 159% cpu 2.735 total
So yes, it does run in parallel. 1.6x on a 2 core machine, at a first attempt.
Whether or not it is efficient is another question. Use +RTS -s you can see the runtime stats:
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
So we had 3 threads running in parallel (2 presumably for the algo, one for the IO manager).
You can reduce running time by adjusting the GC settings. E.g. by using -A
we can reduce GC overhead, and yield genuine parallel speedups.
$ time ./A +RTS -N1 -A100M
./A +RTS -N1 -A100M 1.99s user 0.29s system 99% cpu 2.287 total
$ time ./A +RTS -N2 -A100M
./A +RTS -N2 -A100M 2.30s user 0.86s system 147% cpu 2.145 total
You can improve numeric performance sometimes by using the LLVM backend. This seems to be the case here as well:
$ ghc -O2 -rtsopts --make A.hs -threaded -fforce-recomp -fllvm
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
$ time ./A +RTS -N2 -A100M
./A +RTS -N2 -A100M 2.09s user 0.95s system 147% cpu 2.065 total
Nothing spectacular, but you are improving running time over your single threaded version, and I've not modified your original code in any way. To really improve things, you'll need to profile and optimize.
Revisiting the -A flags, we can bring the time down yet further using a tigher bound on the initial thread allocation area.
$ ghc -Odph -rtsopts --make A.hs -threaded -fforce-recomp -fllvm
$ time ./A +RTS -N2 -A60M -s
./A +RTS -N2 -A60M 1.99s user 0.73s system 144% cpu 1.880 total
So have brought it down to 1.8s from 2.7 (30% improvement) by using the parallel runtime, the LLVM backend, and being careful with GC flags. You can look at the GC flag surface to find the optimum:
With the trough around -A64 -N2
being ideal for the data set size.
I would also strongly consider using manual common subexpression elimination in your inner kernel, to avoid recomputing things excessively.
As Alp suggests, to see the runtime behavior of the program, compile threadscope (from Hackage) and run as follows:
$ ghc -O2 -fllvm -rtsopts -threaded -eventlog --make A.hs
$ ./A +RTS -ls -N2 -A60M
And you get an event trace for your two cores like so:
So what's going on here? You have an initial period (0.8s) of setup time -- allocating your big list and encoding it into a repa array -- as you can see by the single threaded interleaving of GC and execution. Then there's another 0.8s of something on a single core, before your actual parallel work occurs for the last 300ms.
So while your actual algorithm may be parallelizing nicely, all the surrounding test setup basically swamps the result. If we serialize your dataset, and then just load it back from disk, we can get better behavior:
$ time ./A +RTS -N2 -A60M
./A +RTS -N2 -A60M 1.76s user 0.25s system 186% cpu 1.073 total
and now your profile looks healthier:
This looks great! Very little GC (98.9% productivity), and my two cores running in parallel happily.
So, finally, we can see you get good parallelism:
With 1 core, 1.855s
$ time ./A +RTS -N1 -A25M
./A +RTS -N1 -A25M 1.75s user 0.11s system 100% cpu 1.855 total
and with 2 cores, 1.014s
$ time ./A +RTS -N2 -A25M
./A +RTS -N2 -A25M 1.78s user 0.13s system 188% cpu 1.014 total
Now, specifically answer your questions:
- Is there a more idiomatic way of doing this in repa?
In general, repa code should consist of parallel traverals, consumers and produces, and inlinable kernel functions. So as long as you are doing that, then the code is probably idiomatic. If in doubt, look at the tutorial. I'd in general mark your worker kernels (like f
) as inlined.
- Will the above method actually price in parallel?
The code will execute in parallel if you use parallel combinators like computeP
or the various maps and folds. So yes, it should and does run in parallel.
- How can I determine whether my
code really is generating something that will be executed in
parallel?
Generally, you will know a priori because you use parallel operations. If in doubt, run the code and observe the speedup. You may then need to optimize the code.
ghc -O2
is a good start. – Jaffna