Why does attoparsec need manyTill if it backtracks?

Consider the usage of these different parser combinators.

import Control.Applicative.Combinators
import Text.Regex.Applicative
 
main :: IO ()
main = do
  let parser1 = sym '"' *> manyTill anySym (sym '"')
  print $ match parser1 "\"abc\""
  let parser2 = sym '"' *> many anySym <* sym '"'
  print $ match parser2 "\"abc\""

import Control.Applicative.Combinators            
import Text.ParserCombinators.ReadP hiding(many, manyTill)
 
main :: IO ()
main = do
  let parser1 = char '"' *> manyTill get (char '"')
  print $ readP_to_S parser1 "\"abc\""
  let parser2 = char '"' *> many get <* char '"'
  print $ readP_to_S parser2 "\"abc\""

{-# LANGUAGE OverloadedStrings #-}
 
import Control.Applicative.Combinators
import Data.Attoparsec.Text hiding(manyTill)
 
main :: IO ()
main = do
  let parser1 = char '"' *> manyTill anyChar (char '"')
  print $ parseOnly parser1 "\"abc\""
  let parser2 = char '"' *> many anyChar <* char '"'
  print $ parseOnly parser2 "\"abc\""

import Control.Applicative.Combinators
import Text.Megaparsec hiding(many, manyTill)
import Data.Void

main :: IO ()
main = do
  let parser1 = single '"' *> manyTill anySingle (single '"') :: Parsec Void String String
  print $ parseMaybe parser1 "\"abc\""
  let parser2 = single '"' *> many anySingle <* single '"' :: Parsec Void String String
  print $ parseMaybe parser2 "\"abc\""

With all four of them, the manyTill parser successfully matches abc, since this doesn't depend on backtracking. With regex-applicative and ReadP, the many parser also successfully matches abc, since they both backtrack by default. With megaparsec, the many parser fails to match, since it doesn't backtrack by default. So far, everything makes sense. However, with attoparsec, the many parser fails to match, even though it does backtrack: its documentation says "attoparsec parsers always backtrack on failure" and "if you feed incremental input to an a parser, it will require memory proportional to the amount of input you supply. (This is necessary to support arbitrary backtracking.)". Why is this? Isn't that supposed to be exactly what backtracking is?

The meaning of "backtracking" in the Attoparsec documentation is different than the meaning of backtracking for the other backtracking parsers.

It helps to review what "backtracking" means when using try for a Parsec or Megaparsec parser. These parsers have a concept of failing after consuming input ("consume err" = cerr) versus failing after consuming nothing ("empty err" = eerr). For these parsers, the p <|> q alternative operator treats failure of p differently if it's cerr (fail the whole p <|> q immediately) versus eerr (try the alternative q instead). The try function backtracks by converting cerr to eerr. That is, try p <|> q will "backtrack" an erroneous consumption of the input stream in the event p fails with cerr. It's a single step of backtracking on failure within alternatives (though with nested try calls, multiple steps of backtracking can be performed in a sequence/cascade of parse failures).

Attoparsec doesn't differentiate betweeen cerr and eerr, so it's as if all parsers are surrounded by try calls. This means that it automatically performs multiple steps of backtracking on failure within alternatives.

ReadP backtracks implicitly by simultaneously evaluating every possible parse in parallel, discarding those that ever fail, and picking the "first" one that remains. It "backtracks" failures over a tree of all possible parses, whether the failures were generated in the context of an alternative or not.

It turns out that "multiple steps of backtracking on failure within alternatives" is a more modest form of backtracking than "backtracking over a tree of all possible parses".

A couple of simplified examples may help show the difference. Consider the parser:

(anyChar *> char 'a') <|> char 'b'

and the input string "bd". This parser fails with Parsec/Megaparsec. The left-hand alternative consumes the "b" with anyChar before failing, having consumed input (cerr), and the whole parser fails. This works fine with Attoparsec, though: the left-hand alternative fails at char 'a', and Attoparsec backtracks on this failure within an alternative to try char 'b' which succeeds. It also works with ReadP which constructs all possible parses in parallel, before discarding the parse from the left-hand alternative when char 'a' fails, resulting in a single successful parse by char 'b'.

Now, consider the parser:

(anyChar <|> pure '*') *> char 'b'

and the input string "b". (Recall that pure '*' consumes nothing and always succeeds.) This parser fails with Parsec/Megaparsec, because anyChar parses the "b", the pure '*' is ignored, and the empty string isn't matched by char 'b'. It also fails with Attoparsec: anyChar successfully parses the "b", and there's no failure in the context of an alternative, so no backtracking to try the pure '*' alternative. The attempt to parse the empty string with char 'b' subsequently fails. (This failure, if it occurs in the context of another alternative might result in backtracking of that alternative, but never reconsideration of this pure '*' alternative.)

In contrast, this parses fine with ReadP. ReadP parses the alternatives in parallel, considering both anyChar parsing the "b" and pure '*' parsing nothing. When the char 'b' parse is tried, it fails on the former but succeeds on the latter.

Back to your example. When parsing with Attoparsec, because:

many p = ((:) <$> p <*> many p) <|> pure []

the left alternative (:) <$> anyChar <*> many anyChar will continue to successfully match all the way up to and including the point where anyChar matches the closing quotation mark. At EOF, the left-hand side will fail (without consuming input, though Attoparsec doesn't care about that), and the right-hand side will succeed. The only failure within an alternative is at EOF, which didn't consume anything anyway, so Attoparsec's automatic "backtracking" plays no role; Megaparsec would have done the same thing. Anyway, once this many anyChar has succeeded, it won't be revisited, even if the terminating char '"' subsequently fails.

So, that's why you need manyTill to explicitly watch for the terminating character.

Recommended topics

Hot tags