The following question is based upon the accepted answer of this question. The author of the accepted answer said that the streaming helper API in xml-conduit
was not updated for years (source: accepted answer of SO question), and he recommends the Cursor
interface.
Based on the solution of the first question, I wrote the following haskell code which uses the Cursor
interface of xml-conduit
package.
import Text.XML as XML (readFile, def)
import Text.XML.Cursor (Cursor, ($/), (&/), ($//), (>=>),
fromDocument, element, content)
import Data.Monoid (mconcat)
import Filesystem.Path (FilePath)
import Filesystem.Path.CurrentOS (fromText)
data Page = Page
{ title :: Text
} deriving (Show)
parse :: FilePath -> IO ()
parse path = do
doc <- XML.readFile def path
let cursor = fromDocument doc
let pages = cursor $// element "page" >=> parseTitle
writeFile "output.txt" ""
mapM_ ((appendFile "output.txt") . (\x -> x ++ "\n") . show) pages
parseTitle :: Cursor -> [Page]
parseTitle c = do
let titleText = c $/ element "title" &/ content
[Page (mconcat titleText)]
main :: IO ()
main = parse (fromText "input.xml")
This code works on small XML files. However, when the code is run on a 30G XML file, the execution is killed by the OS.
How can I make this code work on a very large XML file?
xml-conduit
? Or is there another XML streaming library you suggest to use for this task? – Colonnade