Optimizing Haskell code
Asked Answered
S

6

16

I'm trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it's so much slower because of using Data.Map instead of hashmaps, but I'm not sure

I'll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.

It comes without saying that I'll take any constructive criticism :).

import random
import re
import cPickle
class Markov:
    def __init__(self, filenames):
        self.filenames = filenames
        self.cache = self.train(self.readfiles())
        picklefd = open("dump", "w")
        cPickle.dump(self.cache, picklefd)
        picklefd.close()

    def train(self, text):
        splitted = re.findall(r"(\w+|[.!?',])", text)
        print "Total of %d splitted words" % (len(splitted))
        cache = {}
        for i in xrange(len(splitted)-2):
            pair = (splitted[i], splitted[i+1])
            followup = splitted[i+2]
            if pair in cache:
                if followup not in cache[pair]:
                    cache[pair][followup] = 1
                else:
                    cache[pair][followup] += 1
            else:
                cache[pair] = {followup: 1}
        return cache

    def readfiles(self):
        data = ""
        for filename in self.filenames:
            fd = open(filename)
            data += fd.read()
            fd.close()
        return data

    def concat(self, words):
        sentence = ""
        for word in words:
            if word in "'\",?!:;.":
                sentence = sentence[0:-1] + word + " "
            else:
                sentence += word + " "
        return sentence

    def pickword(self, words):
        temp = [(k, words[k]) for k in words]
        results = []
        for (word, n) in temp:
            results.append(word)
            if n > 1:
                for i in xrange(n-1):
                    results.append(word)
        return random.choice(results)

    def gentext(self, words):
        allwords = [k for k in self.cache]
        (first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
        sentence = [first, second]
        while len(sentence) < words or sentence[-1] is not ".":
            current = (sentence[-2], sentence[-1])
            if current in self.cache:
                followup = self.pickword(self.cache[current])
                sentence.append(followup)
            else:
                print "Wasn't able to. Breaking"
                break
        print self.concat(sentence)

Markov(["76.txt"])

--

module Markov
( train
, fox
) where

import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) = 
     let l = train (y:z:xs)
     in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l

main = do
  contents <- B.readFile "76.txt"
  print $ train $ B.words contents

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
Shawnna answered 26/5, 2010 at 17:1 Comment(5)
Interesting, also looking for the answer. 16 versus 3 seconds is really a big difference.Housemother
The indentation seems to have gotten mangled for the Python code, by the way...Bother
I don't think your Haskell code accomplishes what you want it to. If you check the output, you'll see that there are no values larger than 2 in the M.Map String Int maps. Do you mean n + o or o + 1 instead of n + 1?Ectosarc
@Travis you are absolutely right but it should be fixed in the edited versionShawnna
Your use of seq in the line beginning "in M.insertWith'" is suspicious. You are constructing a large expression and evaluating it, then throwing away the result and returning something else. Did you mean to switch the arguments, i.e. l seq M.insertWith ...Rockweed
G
7

I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.

import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B

type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
    where go (x:y:[]) m = m
          go (x:y:z:xs) m = let addWord Nothing   = Just $ M.singleton z 1
                                addWord (Just m') = Just $ M.alter inc z m'
                                inc Nothing    = Just 1
                                inc (Just cnt) = Just $ cnt + 1
                            in go (y:z:xs) $ M.alter addWord (x,y) m

train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.alter (addWord z) (x,y) m
          addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
          inc = Just . maybe 1 (+1)

main = do contents <- B.readFile "76.txt"
          let db = train3 $ B.words contents
          print $ "Built a DB of " ++ show (M.size db) ++ " words"

I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.

EDIT As per Travis Brown's very valid point,

train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
          inc k _ = M.insertWith (+) k 1
Guthrun answered 26/5, 2010 at 18:45 Comment(2)
As I matter of style, I think it's better to use something more specific than alter here. We know we'll never need deletion in this situation, and having to add Just like this impairs readability.Ectosarc
Sorry for late response. Could you also explain why it's a faster solution? Basically both do the same, except for the zipping and dropping.Shawnna
B
11

a) How are you compiling it? (ghc -O2 ?)

b) Which version of GHC?

c) Data.Map is pretty efficient, but you can be tricked into lazy updates -- use insertWith' , not insertWithKey.

d) Don't convert bytestrings to String. Keep them as bytestrings, and store those in the Map

Beardless answered 26/5, 2010 at 17:26 Comment(1)
The version is 6.12.1. With your help I was able to squeeze 1 second out of the runtime but it still far from the python version.Shawnna
C
9

Data.Map is designed under the assumption that the class Ord comparisons take constant time. For string keys this may not be the case—and when the strings are equal it is never the case. You may or may not be hitting this problem depending on how large your corpus is and how many words have common prefixes.

I'd be tempted to try a data structure that is designed to operate with sequence keys, such as for example a the bytestring-trie package kindly suggested by Don Stewart.

Chiliasm answered 26/5, 2010 at 22:30 Comment(1)
@don: thanks for the update. I'm convinced you know at least 60% of the contents of hackage by name :-)Chiliasm
G
7

I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.

import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B

type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
    where go (x:y:[]) m = m
          go (x:y:z:xs) m = let addWord Nothing   = Just $ M.singleton z 1
                                addWord (Just m') = Just $ M.alter inc z m'
                                inc Nothing    = Just 1
                                inc (Just cnt) = Just $ cnt + 1
                            in go (y:z:xs) $ M.alter addWord (x,y) m

train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.alter (addWord z) (x,y) m
          addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
          inc = Just . maybe 1 (+1)

main = do contents <- B.readFile "76.txt"
          let db = train3 $ B.words contents
          print $ "Built a DB of " ++ show (M.size db) ++ " words"

I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.

EDIT As per Travis Brown's very valid point,

train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
          inc k _ = M.insertWith (+) k 1
Guthrun answered 26/5, 2010 at 18:45 Comment(2)
As I matter of style, I think it's better to use something more specific than alter here. We know we'll never need deletion in this situation, and having to add Just like this impairs readability.Ectosarc
Sorry for late response. Could you also explain why it's a faster solution? Basically both do the same, except for the zipping and dropping.Shawnna
E
3

Here's a foldl'-based version that seems to be about twice as fast as your train:

train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
  where
    f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)

I tried it on the Project Gutenberg Huckleberry Finn (which I assume is your 76.txt), and it produces the same output as your function. My timing comparison was very unscientific, but this approach is probably worth a look.

Ectosarc answered 26/5, 2010 at 18:55 Comment(0)
S
2

1) I'm not clear on your code. a) You define "fox" but don't use it. Were you meaning for us to try to help you using "fox" instead of reading the file? b) You declare this as "module Markov" then have a 'main' in the module. c) System.Random isn't needed. It does help us help you if you clean code a bit before posting.

2) Use ByteStrings and some strict operations as Don said.

3) Compile with -O2 and use -fforce-recomp to be sure you actually recompiled the code.

4) Try this slight transformation, it works very fast (0.005 seconds). Obviously the input is absurdly small, so you'd need to provide your file or just test it yourself.

{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where

import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train :: [B.ByteString] -> Database
train xs = go xs M.empty
  where
  go :: [B.ByteString] -> Database -> Database
  go (x:y:[]) !m = m
  go (x:y:z:xs) !m =
     let m' =  M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
     in go (y:z:xs) m'

main = print $ train $ B.words fox

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
Scandalize answered 26/5, 2010 at 17:1 Comment(1)
Well yes, I'm a beginner like the tag says :P. I didn't realize the consequences of naming the module something other than Main. And the fox was used for me to test the algorithm. It's easier to check small input than input of an entire bookShawnna
S
1

As Don suggested, look into using the stricer versions o your functions: insertWithKey' (and M.insertWith' since you ignore the key param the second time anyway).

It looks like your code probably builds up a lot of thunks until it gets to the end of your [String].

Check out: http://book.realworldhaskell.org/read/profiling-and-optimization.html

...especially try graphing the heap (about halfway through the chapter). Interested to see what you figure out.

Sextan answered 26/5, 2010 at 17:40 Comment(1)
I made the changes Don Stewart suggested. Previously the code took 41-44 megabytes of memory, now it only takes 29. Graphing the memory shows that TSO takes most of the memory, then comes GHC.types, and then the other datatypes used in code. Memory is increased rapidly on all sections for one second. After that one second TSO and GHC.types keep increasing, all others begin slowly receding. (If I'm reading the graph right)Shawnna

© 2022 - 2024 — McMap. All rights reserved.