Haskell parse big xml file with low memory
Asked Answered
G

2

6

So, I've played around with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter in Real World Haskell (http://book.realworldhaskell.org/read/io.html) I was under the impression that if I run the following code, it will be garbage collected as I go through it.

However, when I run it on a big file, memory usage keeps climbing as it runs.

runghc parse.hs bigfile.xml

What am I doing wrong? Is my assumption wrong? Does the map/filter force it to evaluate everything?

import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX 
import System.Environment (getArgs)

main :: IO ()
main = do
    args <- getArgs
    contents <- BSL.readFile (head args)
    -- putStrLn $ U.toString contents
    let events = parse defaultParseOptions contents 
    mapM_ print $ map getTMSId $ filter isEvent events

isEvent :: SAXEvent String String -> Bool 
isEvent (StartElement "event" as) = True
isEvent _ = False

getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as

My end goal is to parse a huge xml file with a simple sax-like interface. I don't want to have to be aware of the whole structure to get notified that I've found an "event".

Grudging answered 9/11, 2011 at 15:19 Comment(3)
Do you also get this behavior when compiling it rather than running it in interpreted mode?Runge
And don't forget to use optimization (-O2) when compiling.Durarte
Do you have to compile and optimize to get it to garbage collect? If so, I'll be sure to try that in the futureGrudging
L
8

I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.

The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.

Laze answered 10/11, 2011 at 12:42 Comment(0)
D
3

This appears to be an issue with hexpat. Running compiled, with optimization, and just for a simple task such as length, results in linear memory use.

Looking at hexpat, I think there is excessive caching going on (see the parseG function). I suggest contacting the hexpat maintainer(s) and asking if this is expected behavior. It should have been mentioned in the haddocks either way, but resource consumption seems to get ignored too often in library documentation.

Durarte answered 9/11, 2011 at 18:47 Comment(2)
From a quick heap profile, it looks like most of it comes from leaking (:) constructors.Runge
Nice to know my assumption wasn't wrong. I guess I'll keep messing around with other packages. Thanks!Grudging

© 2022 - 2024 — McMap. All rights reserved.