Is it possible to efficiently look ahead more than one Char in Attoparsec?

I'm trying to augment Haskell's Attoparsec parser library with a function

takeRegex :: Regex -> Parser ByteString

using one of the regexp implementations.

(Motivation: Good regex libraries can provide performance that is linear to the length of the input, while attoparsec needs to backtrack. A portion of my input is particularly amenable to parsing using regexps, and even the backtracking Text.Regex.PCRE library gets me 4x speedup over attoparsec code for that piece.)

Attoparsec used to have a getInput :: Parser ByteString function to get (witout consuming) remaining input; that would probably have been quite perfect for my purposes, as my input is non-incremental, strict and reasonably small – I run the parser for a line of a log file at a time. With it, it seems I could have done something like

takeRegex re = do
  input <- getInput
  m <- matchM re input
  take $ length m

Unfortunately recent versions of attoparsec lack this function. Is there some way to achieve the same? Why has the function been removed?

Now there is the takeByteString :: Parser ByteString function, which takes and consumes the rest of the input. If there was a function to attempt a parse and get the result without actually consuming anything, this could be used in conjunction with that, but I cannot seem to find (or figure out how to implement) such a function either.

Is there some way to achieve this with the current version of attoparsec?

There are a few solutions to this, but none are great....

Method 1- Fast to implement, but not so fast to run

Well, (according to http://hackage.haskell.org/package/attoparsec-0.10.1.1/docs/Data-Attoparsec-ByteString.html), attoparsec always backtracks on failure, so you can always do something like this-

parseLine1 = do
  line <- takeTill (== '\n')
  char '\n'
  case <some sort of test on line, ie- a regex> of
    Just -> return <some sort of data type>
    Nothing -> fail "Parse Error"

then later many of these chained together will work as expected

parseLine = parseLine1 <|> parseLine2

The problem with this solution is, as you can see, you are still doing a bunch of backtracking, which can really slow things down.

Method 2- The traditional method

The usual way to handle this type of thing is to rewrite the grammar, or in the case of a parser combinator, move stuff around, to make the full algorithm need only one character of lookahead. This can almost always be done in practice, although it sometimes makes the logic much harder to follow....

For example, suppose you have a grammar production rule like this-

pet = "dog" | "dolphin"

This would need two characters of lookahead before either path could be resolved. Instead you can left factor the whole thing like this

pet => "ca" halfpet
halfpet => "g" | "lphin"

No parallel processing is needed, but the grammar is much uglier. (Although I wrote this as a production rule, there is a one to one mapping to a similar parser combinator).

Method 3- The correct way, but involved to write.

The true way that you want to do this is to directly compile a regex to a parser combinator.... Once you compile any regular language, the resulting algorithm always only need one character of lookahead, so the resulting attoparsec code should be pretty simple (like the routine in method 1 for a single character read), but the work will be in compiling the regex.

Compiling a regex is covered in many textbooks, so I won't go into detail here, but it basically amounts to replacing all the ambiguous paths in the regex state machine with new states. Or to put it differently, it automatically "left factors" all the cases that would need backtracking.

(I wrote a library that automatically "left factors" many cases in context free grammars, turning almost any context free grammar into linear parser once, but I haven't yet made it available.... some day, when I have cleaned it up I will).

Recommended topics

Hot tags