Problem
Hello, I'm using accelerate library to create an application allowing the user to interactively call functions that process images, that's why I'm basing on and extending ghci using ghc api.
The problem is that when running the compiled executable from the shell the computations are done under 100ms (slightly less than 80), while running the same compiled code within ghci it takes over 100ms (on average a bit more than 140) to finish.
Resources
sample code + execution logs: https://gist.github.com/zgredzik/15a437c87d3d8d03b8fc
Description
First of all: the tests were ran after the CUDA kernel was compiled (the compilation itself added additional 2 seconds but that's not the case).
When running the compiled executable from the shell the computations are done in under 10ms. (shell first run
and second shell run
have different arguments passed to make sure the data wasn't cached anywhere).
When trying to run the same code from ghci and fiddling with the input data, the computations take over 100ms. I understand that interpreted code is slower than compiled one, but I'm loading the same compiled code within the ghci session and calling the same top level binding (packedFunction
). I've explicitly typed it to make sure it is specialized (same results as using the SPECIALIZED pragma).
However the computations do take less than 10ms if I run the main
function in ghci (even when changing the input data with :set args
between consecutive calls).
Compiled the Main.hs
with ghc -o main Main.hs -O2 -dynamic -threaded
I'm wondering where the overhead comes from. Does anyone have any suggestions as to why this is happening?
A simplified version of the example posted by remdezx :
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA as C
import Data.Time.Clock (diffUTCTime, getCurrentTime)
main :: IO ()
main = do
start <- getCurrentTime
print $ C.run $ A.maximum $ A.map (+1) $ A.use (fromList (Z:.1000000) [1..1000000] :: Vector Double)
end <- getCurrentTime
print $ diffUTCTime end start
When I compile it and execute it takes 0,09s to finish.
$ ghc -O2 Main.hs -o main -threaded
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking main ...
$ ./main
Array (Z) [1000001.0]
0.092906s
But when I precompile it and run in interpreter it takes 0,25s
$ ghc -O2 Main.hs -c -dynamic
$ ghci Main
ghci> main
Array (Z) [1000001.0]
0.258224s
Data.Array.Accelerate.CUDA.run
and I've noticed that whenacclerate
library is loaded to ghci,run
executes 3 times slower than when its used in executable. I tried adding following pragmas but no effect.{-# SPECIALISE run :: Acc (Array DIM2 Double) -> (Array DIM2 Double) #-} {-# SPECIALISE run :: Acc (Array DIM2 Float) -> (Array DIM2 Float) #-}
. Can we optimize somehow this run function for ghci? – Pyrophoric$ ghc -O2 Test.hs && ghci Test
results in ghci recompilingTest
in interpreted mode because the flags changed. (No-O2
in the second invocation.) I don't know if that's relevant toaccelerate
. I can't test this example--I don't have a CUDA system handy--so I'm hesitant to post an answer. – Intoxication-dynamic
flag seemed to have solved the problem). This may be indicated by a) the part in logs from ghci where> :l Main
results inOk, modules loaded: Main.
(instead of[1 of 1] Compiling Main ( Main.hs, interpreted )
) ; b) running:show modules
results inMain ( Main.hs, Main.o )
(should've had that in the logs too I guess) – Veritable-fobject-code
and-O2
- this also makes ghci to load precompiled code (we can see ith running:show modules
, but there is still no speedup. Something still remains not optimized. – PyrophoricOk, modules loaded: Main.
just indicates that ghci didn't recompile or reinterpret the module. It's conceivable that the module it loads is compiled but not optimized--in my testing, I found several corner cases where it was not obvious whether the loaded object code was optimized. If you want to be completely sure, you can use the example code from the other topic, which proves whether the rewrite rule (and presumably other optimizations) worked. That said, it would make sense if something other than GHC optimizations makes a difference here. – IntoxicationisOptimised=True
, but code is still slow. – Pyrophoricaccelerate
a few days ago, and it didn't look like it was designed to rely heavily on GHC optimizations. – Intoxication