How do I make Attoparsec parser succeed without consuming (like parsec lookAhead)
Asked Answered
C

2

5

I wrote a quick attoparsec parser to walk an aspx file and drop all the style attributes, and it's working fine except for one piece of it where I can't figure out how to make it succeed on matching > without consuming it.

Here's what I have:

anyTill = manyTill anyChar
anyBetween start end = start *> anyTill end

styleWithQuotes = anyBetween (stringCI "style=\"") (stringCI "\"")
styleWithoutQuotes = anyBetween (stringCI "style=") (stringCI " " <|> ">")
everythingButStyles = manyTill anyChar (styleWithQuotes <|> styleWithoutQuotes) <|> many1 anyChar

I understand it's partially because of how I'm using manyTill in everythingButStyles, that's how I am actively dropping all the styles stuff on the ground, but in styleWithoutQuotes I need it to match ">" as an end, but not consume it, in parsec I would have just done lookAhead ">" but I can't do that in attoparsec.

Charterhouse answered 2/11, 2012 at 20:10 Comment(1)
Which stream type (ByteString or Text) you're using is unclear, but, in case you're using Text, you should use asciiCI rather than stringCI; the latter is now deprecated.Elbrus
H
5

Meanwhile, the lookAhead combinator was added to attoparsec, so now one can just use lookAhead (char '>') or lookAhead (string ">") to achieve the goal.

Below is a workaround from the times before its introduction.


You can build your non-consuming parser using peekWord8, which just looks at the next byte (if any). Since ByteString has a Monoid instance, Parser ByteString is a MonadPlus, and you can use

lookGreater = do
    mbw <- peekWord8
    case mbw of
      Just 62 -> return ">"
      _ -> mzero

(62 is the code point of '>') to either find a '>' without consuming it or fail.

Heerlen answered 2/11, 2012 at 23:8 Comment(2)
Works perfectly! Thanks! I was thinking I could use peek somehow for this but couldn't think up how to make it return the correct string type to fit (I'll have to go read about mzero now, I still am not particularly familiar with all the type classes and their instances.. I'm afraid the typeclassopedia will swallow my soul if I force myself through it)Charterhouse
You'll get your soul back at the end. Slightly transformed.Heerlen
S
5
anyBetween start end = start *> anyTill end

Your anyBetween parser eats its last character because anyTill does - it's designed to parse upto an end marker, but assuming you didn't want to keep the closing brace in the input to parse again.

Notice that your end parsers are all single character parsers, so we can change the functionality to make use of this:

anyBetween'' start ends = start *> many (satisfy (not.flip elem ends))

but many isn't as efficient as Attoparsec's takeWhile, which you should use as much as possible, so if you've done

import qualified Data.Attoparsec.Text as A

then

anyBetween' start ends = start *> A.takeWhile (not.flip elem ends)

should do the trick, and we can rewrite

styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>']

If you want it to eat the ' ' but not the '>' you can explicitly eat spaces afterwards:

styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>'] 
                     <* A.takeWhile isSpace

Going for more takeWhile

Perhaps styleWithQuotes could do with a rewrite to use takeWhile as well, so let's make two helpers on the lines of anyBetween. They take from a starting parser up to an ending character, and there's inclusive and exclusive versions:

fromUptoExcl startP endChars = startP *> takeTill (flip elem endChars)
fromUptoIncl startP endChars = startP *> takeTill (flip elem endChars) <* anyChar

But I think from what you said, you want styleWithoutQuotes to be a hybrid; it eats ' ' but not >:

fromUptoEat startP endChars eatChars = 
            startP 
            *> takeTill (flip elem endChars) 
            <* satisfy (flip elem eatChars)

(All of these assume a small number of characters in your end character lists, otherwise elem isn't efficient - there are some Set variants if you're checking against a big list like an alphabet.)

Now for the rewrite:

styleWithQuotes' = fromUptoIncl (stringCI "style=\"") "\""
styleWithoutQuotes' = fromUptoEat (stringCI "style=") " >" " "

The overall parser

everythingButStyles uses <|> in a way that means that if it doesn't find "style" it will backtrack then take everything. This is an example of the sort of thing which can be slow. The problem is that we fail late - at the end of the input string, which is a bad time to make a choice about whether we should fail. Let's go all out and try to

  1. Fail straight away if we're going to fail.
  2. Maximise use of the faster parsers from Data.Attoparsec.Text.Internal

Idea: take until we get an s, then skip the style if there's one there.

notStyleNotEvenS = takeTill (flip elem "sS") 
skipAnyStyle = (styleWithQuotes' <|> styleWithoutQuotes') *> notStyleNotEvenS 
               <|> cons <$> anyChar <*> notStyleNotEvenS

The anyChar is usually an s or S, but there's no sense checking that again.

noStyles = append <$> notStyleNotEvenS <*> many skipAnyStyle 

parseNoStyles = parseOnly noStyles
Smite answered 2/11, 2012 at 23:57 Comment(9)
Yeah I know takeWhile is supposed to be more efficient, but from what I was trying to do it seemed takeWhile only allows you to match on single bytes at a time where most of my matches as you can see are on strings. Am I misunderstanding how to use takeWhile? Thanks for your contribution here as wellCharterhouse
@JimmyHoffa Your starts are all matches on strings (and string is efficient, so that's OK). Your ends (in your example code) are all single characters, or <|>s of them, that's why I allowed a list of end characters. Does the edit clarify?Smite
Ah you're right. Cool, so basically many and string is the right way to go anytime you want to match lots of strings, but take and satisfy are the way to go when you're trying to match on single chars?Charterhouse
@JimmyHoffa Pretty much. satisfy is OK for single use, but whenever you feel like using many with satisfy you should be using takeWhile or takeTill. (New edit too.)Smite
Thanks for all the details! and yes I didn't add the running parser I use which is parseAllOf = parseOnly . many1 so I call parseAllOf everythingButStylesCharterhouse
@JimmyHoffa I've added a section about optimising the main parser too. I assumed from the start that using AttoParsec was about speed, otherwise you might have used parsec and stuck with strings.Smite
I'm working with Attoparsec because from my understanding it can do everything parsec can do, and some things parsec can't, as well as being faster, and Attoparsec isn't any more difficult than parsec from what I've seen. Given that, I figure I don't have reason to bother using parsec for anything anymore.Charterhouse
@JimmyHoffa Your reasons make sense. (It's terribly tempting when writing all those takeWhile functions to write this from scratch using Prelude's break, scan, splitAt etc. to hand-craft a parser-free solution, but there's a reason we're encouraged to use parsers and not to write Read instances by hand: it's messy and hard to read (haha pun - source code and data can both be hard to read hahaha programmer's humour lol rofl so funny). This is clear, and I'm sure it's fast enough for your data.)Smite
Because styleWithQuotes' and styleWithoutQuotes' overlap (they both start by parsing the string "style="), you could also refactor styleWithQuotes' <|> styleWithoutQuotes' into one parser in order to reduce the need for backtracking.Elbrus

© 2022 - 2024 — McMap. All rights reserved.