Running Haskell HXT outside of IO?
Asked Answered
Z

2

27

All the examples I've seen so far using the Haskell XML toolkit, HXT, uses runX to execute the parser. runX runs inside the IO monad. Is there a way of using this XML parser outside of IO? Seems to be a pure operation to me, don't understand why I'm forced to be inside IO.

Zennie answered 10/10, 2010 at 18:5 Comment(3)
A quick glance makes it look to me like runX reads the XML file and is thus impure IO.Robenarobenia
I think HXT's parser is an 'online' parser i.e. it doesn't need to read the whole input to start producing output. The upside of this is that (in principle at least) it can run with a constant memory footprint, the downside is that it has to read "chunks" of the input on demand so must be in IO.Geranium
But couldn't this easily be solved using lazy IO? For example, in order to get high-performance XML-parsing, right now I use Hexpat (for ByteStrings, and lazy SAX-parsing.) and polyparse (Poly.Lazy.) I'm getting constant memory footprints, fast processing, and all the parsing is done in pure functions!Zitella
D
27

You can use HXT's xread along with runLA to parse an XML string outside of IO.

xread has the following type:

xread :: ArrowXml a => a String XmlTree

This means you can compose it with any arrow of type (ArrowXml a) => a XmlTree Whatever to get an a String Whatever.

runLA is like runX, but for things of type LA:

runLA :: LA a b -> a -> [b]

LA is an instance of ArrowXml.

To put this all together, the following version of my answer to your previous question uses HXT to parse a string containing well-formed XML without any IO involved:

{-# LANGUAGE Arrows #-}
module Main where

import qualified Data.Map as M
import Text.XML.HXT.Arrow

classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> pairs) >>> arr M.fromList
  where
    divs = getChildren >>> hasName "div"
    pairs = proc div -> do
      cls <- getAttrValue "class" -< div
      val <- deep getText         -< div
      returnA -< (cls, val)

getValues :: (ArrowXml a) => [String] -> a XmlTree (String, Maybe String)
getValues cs = classes >>> arr (zip cs . lookupValues cs) >>> unlistA
  where lookupValues cs m = map (flip M.lookup m) cs

xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
      \<div class='c3'>123</div><div class='c4'>234</div></div>"

values :: [(String, Maybe String)]
values = runLA (xread >>> getValues ["c1", "c2", "c3", "c4"]) xml

main = print values

classes and getValues are similar to the previous version, with a few minor changes to suit the expected input and output. The main difference is that here we use xread and runLA instead of readString and runX.

It would be nice to be able to read something like a lazy ByteString in a similar manner, but as far as I know this isn't currently possible with HXT.


A couple of other things: you can parse strings in this way without IO, but it's probably better to use runX whenever you can: it gives you more control over the configuration of the parser, error messages, etc.

Also: I tried to make the code in the example straightforward and easy to extend, but the combinators in Control.Arrow and Control.Arrow.ArrowList make it possible to work with arrows much more concisely if you like. The following is an equivalent definition of classes, for example:

classes = (getChildren >>> hasName "div" >>> pairs) >. M.fromList
  where pairs = getAttrValue "class" &&& deep getText
Drisko answered 10/10, 2010 at 19:11 Comment(1)
Hi Travis. That's great. Your help is very useful in my attempt to come to grips with HXT.Zennie
P
1

Travis Brown's answer was very helpful. I just want to add my own solution here, which I think is a bit more general (using the same functions, just ignoring the problem-specific issues).

I was previously unpickling with:

upIO      :: XmlPickler a => String -> IO [a]
upIO str   = runX $ readString [] str >>> arrL (maybeToList . unpickleDoc xpickle)

which I was able to change to this:

upPure    :: XmlPickler a => String -> [a]
upPure str = runLA (xreadDoc >>> arrL (maybeToList . unpickleDoc xpickle)) str

I completely agree with him that doing this gives you less control over the configuration of the parser etc, which is unfortunate.

Picrite answered 21/10, 2014 at 9:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.