Why does -O2 have such a big impact on a simple L1 distance calculator in Haskell?
Asked Answered
R

0

11

I have implemented a simple L1 distance calculator using Haskell. Since I am interested in performance I used unboxed vectors to store the images to compare.

calculateL1Distance :: LabeledImage -> LabeledImage -> Int
calculateL1Distance reference test = 
            let
              substractPixels :: Int -> Int -> Int
              substractPixels a b = abs $ a - b
              diff f = Vec.sum $ Vec.zipWith substractPixels (f reference) (f test)
            in
              diff pixels

From what I know (I am very new to Haskell) stream fusion should make this code run as a simple loop. So it should be fast. However, performance turned out to be low when compiled with

ghc -O -fforce-recomp -rtsopts -o test .\performance.hs

The program took around 60sec:

 198,871,911,896 bytes allocated in the heap
   1,804,017,536 bytes copied during GC
     254,900,000 bytes maximum residency (14 sample(s))
       9,020,888 bytes maximum slop
             579 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     378010 colls,     0 par    2.312s   2.949s     0.0000s    0.0063s
  Gen  1        14 colls,     0 par    0.562s   0.755s     0.0539s    0.2118s

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time   58.297s  ( 64.380s elapsed)
  GC      time    2.875s  (  3.704s elapsed)
  EXIT    time    0.016s  (  0.088s elapsed)
  Total   time   61.188s  ( 68.176s elapsed)

  %GC     time       4.7%  (5.4% elapsed)

  Alloc rate    3,411,364,878 bytes per MUT second

  Productivity  95.3% of total user, 94.6% of total elapsed

However, the performance drastically increased when compiling with

ghc -O2 -fforce-recomp -rtsopts -o test .\performance.hs

The runtime dropped to around 13sec:

   2,261,672,056 bytes allocated in the heap
   1,571,668,904 bytes copied during GC
     241,064,192 bytes maximum residency (12 sample(s))
       8,839,048 bytes maximum slop
             544 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2951 colls,     0 par    1.828s   1.927s     0.0007s    0.0059s
  Gen  1        12 colls,     0 par    0.516s   0.688s     0.0573s    0.2019s

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time   10.484s  ( 16.598s elapsed)
  GC      time    2.344s  (  2.615s elapsed)
  EXIT    time    0.000s  (  0.105s elapsed)
  Total   time   12.828s  ( 19.324s elapsed)

  %GC     time      18.3%  (13.5% elapsed)

  Alloc rate    215,718,348 bytes per MUT second

  Productivity  81.7% of total user, 86.4% of total elapsed

The effect is even stronger when using larger parts of the image sets, due to image loading taking a smaller part of the runtime. According to HaskellWiki there is effectively nearly no difference between -O and -O2 (https://wiki.haskell.org/Performance/GHC). However, I observe a huge effect. I am wondering if I am missing something. Is there any optimization I should do to my code that the compiler (GHC) is doing when compiling with -O2? If yes, what does he do? From what I read, the main performance improvement comes from stream fusion and from me the function looks like stream fusion can be applied.

For reference, here is the complete example for my test program.

import Data.List
import Data.Word
import qualified Data.ByteString as ByteStr
import qualified Data.ByteString.Char8 as ByteStrCh8
import qualified Data.Vector.Unboxed as Vec

data LabeledImage = LabeledImage {
       labelIdx :: Int
     , pixels :: Vec.Vector Int
} deriving (Eq)

extractLabeledImages :: ByteStr.ByteString -> [LabeledImage] -> [LabeledImage]
extractLabeledImages source images
      | ByteStr.length source >= imgLength =
                    let
                      (label,trailData) = ByteStr.splitAt labelBytes source
                      (rgbData,remainingData) = ByteStr.splitAt colorBytes trailData
                      numLabel = fromIntegral (ByteStr.head label)
                      pixelValues = Vec.generate (ByteStr.length rgbData) (fromIntegral . ByteStr.index rgbData)
                    in
                      extractLabeledImages remainingData (images ++ [LabeledImage numLabel pixelValues])
      | otherwise = images
      where
        labelBytes = 1
        colorBytes = 3072
        imgLength = labelBytes + colorBytes

calculateL1Distance :: LabeledImage -> LabeledImage -> Int
calculateL1Distance reference test = 
            let
              substractPixels :: Int -> Int -> Int
              substractPixels a b = abs $ a - b
              diff f = Vec.sum $ Vec.zipWith substractPixels (f reference) (f test)
            in
              diff pixels

main = do
       batch1Raw <- ByteStr.readFile "M:\\Documents\\StanfordCNN\\cifar10\\data_batch_1.bin"
       testBatchRaw <- ByteStr.readFile "M:\\Documents\\StanfordCNN\\cifar10\\test_batch.bin"

       let referenceImages = take 1000 $ extractLabeledImages batch1Raw []
       let testImages = take 1000 $ extractLabeledImages testBatchRaw []

       putStrLn "Created image sets. Starting tests."
       let results = [calculateL1Distance referenceImage testImage | referenceImage <- referenceImages, testImage <- testImages ]
       ByteStr.writeFile "M:\\Documents\\StanfordCNN\\results.txt" (ByteStrCh8.pack $ show results)
Reporter answered 9/5, 2017 at 9:55 Comment(6)
Does the performance stay fast if you run the -O version after running the -O2 version? It's possible you're seeing the difference between loading the files from disk and retrieving them from the operating system's page cache.Beachhead
I doubt that's the difference, due to the difference in allocation rate and bytes allocated. The -O2 version is probably allocating fewer intermediate objects. If you want to see the difference the best thing to do is view the compiled core (described in the linked article).Beachhead
@Beachhead The difference is due to -fspec-constr. Without actual data to reproduce this its hard to generate some profiles, but in the end the inlined vector code needs -fspec-constr to get the last bit of performance.Lamothe
Also, OP, depending on your use case, you want to {-# INLINE #-} your functions. Note that this happens already to calculateL1Distance with -O, it's just a remark when you start to split those functions into other modules.Lamothe
@Cirdec: I tried looking at the Core. However, it is quite tough to get into it.Could not even find calculateL1Distance in it. Quess it is inlined.Reporter
@Zeta: I read up about -fspec-constr (downloads.haskell.org/~ghc/latest/docs/html/users_guide/…). This kind of optimization definitely makes sense. Thank you so much. Interesting read! As for the data. It can be found here: cs.toronto.edu/~kriz/cifar-10-binary.tar.gzReporter

© 2022 - 2024 — McMap. All rights reserved.