Processing a very large text file with lazy Texts and ByteStrings

I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map to keep track of the counts of each word as I traverse the file. The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file! The code is shown below. Please note that I tried the following:

Using Data.HashMap.Strict instead of Data.Map.Strict. Data.Map seems to perform better in terms of slower memory consumption increase rate.

Reading the files using lazy ByteString instead of lazy Text. And then I encode it to Text do some processing and then encode it back to ByteString for IO.

import Data.Text.Lazy (Text(..), cons, pack, append)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import Data.Map.Strict hiding (foldr, map, foldl')
import System.Environment
import System.IO
import Data.Word

dictionate :: [Text] -> Map Text Word16
dictionate = fromListWith (+) . (`zip` [1,1..])

main = do
    [file,out] <- getArgs
    h <- openFile file ReadMode
    hO <- openFile out WriteMode
    mapM_ (flip hSetEncoding utf8) [h,hO]
    txt <- TI.hGetContents h
    TI.hPutStr hO . T.unlines . 
      map (uncurry ((. cons '\t' . pack . show) . append)) . 
      toList . dictionate . T.words $ txt
    hFlush hO
    mapM_ hClose [h,hO]
    print "success"

What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?

This memory usage is expected. Data.Map.Map consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16 takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.

Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N words.

In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6 of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6 different words.

As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.

Recommended topics

Hot tags