Performance of uu-parsinglib compared to "try" in Parsec
Asked Answered
M

2

7

Question

I know Parsec and uu-parsinglib and I've written parsers in both of them. Recently I discovered, that there is a problem in uu-parsinglib, which could significantly affect its performance and I do not see a way to solve it.

Lets consider following Parsec parsers:

pa = char 'a'
pb = char 'b'
pTest = many $ try(pa <* pb)

What would be the equivalent in uu-parsinglib? It would not be the following:

pa = pSym 'a'
pb = pSym 'b'
pTest = pList_ng (pa <* pb)

The difference is, that in Parsec, many would eat (pa <* pb) (pairs of "ab") greedy until it is not matched anymore, while in uu-parsinglib, pList_ng is not greedy, so it will keep in memory possible backtrack ways after parsing each (pa <* pb).

Is there any way to write something like pList(try(pa <* pb)) in uu-parsinglib?

Example

A good example would be

pExample = pTest <* (pa <* pb)

and a sample input of "ababab".

With Parsec, we would get error (because pTest is greedy parsing pairs of "ab"), but in uu-parsinglib, the string would be parsed with no problems.

Edit

We cannot switch from pList_ng to pList, because it would be not equivalent to Parsec version. For example:

pExample2 = pTest <* pa

and a sample input of "ababa" would success in Parsec, but fail in uu-parsinglib, when using greedy pList.

Of course uu-parsinglib will succeed if we use pList_ng here, but it could be a lot slower for some inputs and rules. For example, considering the input "ababab", Parsec would simply fail, because pTest would consume whole string and pa would not be matched. uu-parsinglib will fail also, but checking a more steps - it will match whole string and fail, then throw away last "ab" pair and fail again, etc. If we have some nested rules and funny input text, it will make a huge difference.

A little benchmark

To prove, that the problem exists in real, consider following grammar (in a pseudocode - but I think it is very intuitive):

pattern = many("ab") + "a"
input   = many(pattern)

So as input to our program we get a string containing patterns, for example "abababaaba" contains 2 patterns "abababa" and "aba".

Lets make parsers in both libraries!

uu-parsinglib:

import Data.Char
import qualified Text.ParserCombinators.UU      as UU
import Text.ParserCombinators.UU                hiding(parse)
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)

import System.TimeIt (timeIt)

pa = pSym 'a'
pb = pSym 'b'
pTest = pList $ pList_ng ((\x y -> [x,y]) <$> pa <*> pb) <* pa

main :: IO ()
main = do
    timeIt maininner
    return ()

maininner = do
    let (x,e) = UU.parse ((,) <$> pTest <*> pEnd) (createStr (LineColPos 0 0 0) (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a")))
    print $ length x

Parsec:

import           Control.Applicative
import           Text.Parsec          hiding (many, optional, parse, (<|>))
import qualified Text.Parsec          as Parsec

import System.TimeIt (timeIt)

pa = char 'a'
pb = char 'b'
pTest = many $ many (try ((\x y -> [x,y]) <$> pa <*> pb)) <* pa

main :: IO ()
main = do
    timeIt maininner2
    return ()

maininner2 = do
    let Right x = Parsec.runParser pTest (0::Int) "test" $ (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a"))
    print $ length x

Result? uu-parsinglib is > 300% slower:

uu-parsinglib - 3.19s
Parsec        - 1.04s

(compiled with ghc -O3 flag)

Meathead answered 30/12, 2013 at 19:50 Comment(10)
I know next to nothing about Parsec and since you're not explicit about the behavior you want to see, that doesn't make it easy to understand the question. If you want greedy parsing, why use the _ng variant? Could you clarify?Strohl
@Strohl I'm pretty sure that in uuparsinglib the greedy and non-greedy variants affect the order of parse results, but not which results you get -- and hence not which things are forced to stay in memory during backtracking.Amerce
@danilo2 Does your use case admit a separation into lexing and parsing steps? If so, that usually reduces the amount of backtracking needed significantly.Amerce
@DanielWagner uu-parsinglib only gives a single result unless you use the amb combinator to specify that you're parsing an ambiguous grammar. If I understand danilo2's example correctly, switching to pList should make uu-parsinglib behave similar to Parsec (that is, produce an error).Strohl
@DanielWagner: I've edited the question, please take a look. Switching to pList will not make simmilar behaviour to Parsec. If we use the greedy pList instead of pList_ng, Parsec would success on pExample2, but uu-parsinglib would fail.Meathead
@raymonad: I'm sorry If I was not clear enough. If we consider the Parsec version, try(pa <* pb) is treated as "single element" - if pa fails, the whole element fails. Combinator many is greedy. I want to get in uu-parsinglib exactly the same behaviour than in pExample2 in Parsec. (Please take a look at the updated question)Meathead
@DanielWagner: No, it does not admit a separation into lexing and parsing steps. Even in such case, uu-parsinglib is a big and stable library and should allow for tweaking the performance crutial elements, like Parsec does.Meathead
@danilo2 I don't think you're going to get exactly the same behavior (exact same memory usage). But did you actually observe this huge difference? It sounds like you're not completely clear on the inner workings of uu-parsinglib. The details have started to fade from my memory as well, but I may try to produce some sort of answer.Strohl
@raymonad: Of course I do not know uu-parsinglib internals as well, but I do not think we have to make a benchmarks to see one clear thing - If we focus on Example1 then we see, that uu-parsinglib is making a backtracking (when providing input of "ababab"), so it does also in Example2. I do not like the idea of writing 2 compilers in both libraries and benchmark them to prove one is slower than another (and I put a great emphasis on performance in case of this compiler). (and no, I will not use attoparsec :) )Meathead
@raymonad: I've maded a small benchmark - check out the updated question :)Meathead
H
10

To understand the subtleties it is important to understand the differences between the try construct in Haskell and the non-greedy parsing strategy used in uu-parsinglib. Effectively the latter is a try which just looks ahead one symbol. In that respect it is less powerful than the try construct from Parsec, in which you specify that a specific construct has to be present completely. And then there is the underlying different overall strategy. Parsec uses a back-tracking strategy with explicit tries to commit, whereas uu-parsinglib uses a breadth-first strategy with an occasional single symbol look-ahead.

So it does not come as a surprise that there is a time difference between the two. In the Parsec case it is decided that the tried alternative does apply after having seen the complete construct (two symbols), whereas the greedy uu-parsinglib decides that this must be the right alternative after having successfully seen the first symbol. And this conclusion may be unjustified.

If one moves to the breadth-first strategy uu-parsinglib uses one has to keep track of several alternatives at the same time, and this take time. Two alternative, twice the time, etc.

The advantage of Parsec is that you can tune a back-tracking parser by liberal use of try constructs and by placing alternatives in the right order, but you are also more likely to ask questions on mailing list about why your parsers do not work as expected. You are not so much writing a grammar as writing parser. The uu-parsinglib starts from the other end of the spectrum: we try to accept quite a large collection of grammars (and the parsers implied by them).

My feeling is also that in the presence of try constructs having excellent error-repairing parsers is much more complicated. Once a try construct fails it is not possible to go back there and decide that, with a small correction, it is much better than the alternatives that come after it.

Humanist answered 19/1, 2014 at 1:9 Comment(0)
S
2

The behavior you're describing (using pList_ng) does apply to other parsers (such as the simple list-of-successes method described in, for example, Jeroen Fokker's Functional Parsers) combinators), but not to uu-parsinglib. The library uses a breadth-first strategy to avoid space leaks (as a result of hanging on to the entire input, as would be the case when using a depth-first strategy). That is why I asked whether you had either created a test case or looked at the internals at all.

For a more technical description, see the paper in Text.ParserCombinators.UU.README (and maybe the source code after that). Here I'll use pExample2 to sketch the parsing process. Branching happens in pList_ng (in pTest), which uses <|> to recognize either the empty string or another element. Because pTest is followed by pa, instead of the empty string, the alternative to parsing another element is actually recognizing a single 'a'.

When we see the first 'a' in the input, both alternatives can successfully parse this character. Next, we see a 'b'. Now the alternative that parses only a single 'a' cannot make any further progress, so we drop that one. There is one alternative left: the one that recognizes (a list of) 'a' followed by 'b' (pTest). Next up is another 'a', and there are again two alternatives to consider. But then we see a 'b' and, again, we can immediately drop the second alternative. Then there is one last characer, an 'a', which once more means two alternatives. But now we get to the end of the string and only the alternative obtained by letting pa recognize the final a leads to a succesful parse.

Considering the alternative input "ababab", we see that the pa alternative fails again when we get to the final 'b', so only the pTest alternative remains. That one finishes because we get to the end and then pa (following pTest in pExample2) fails, so we get an error.

At any point, uu-parsinglib only needs to keep alternatives in memory that have not yet failed, and the breadth-first strategy ensures that all alternatives are evaluated in lockstep, so there is no need to hang on to the first 'a' and 'b' until the end of the string is reached and it does not match the whole string first before backtracking.

Edit:

From what I gather about Parsec, this is indeed less efficient, because Parsec does not consider pa until pTest finishes. Section 5.1 in the paper says a few words about something like try for uu-parsinglib, including some objections. Raw speed is not a primary goal and I once saw a presentation of some benchmarks where uu-parsinglib also did not come out on top, but overall it performed reasonably well. If speed is so important for this compiler you mentioned and if you don't need any extra features such as online results or error correction, maybe you should just stick to Parsec? (Or look for more comprehensive benchmarks first.)

There are clearly significant differences between the two libraries, so I'm not surprised to see that Parsec is faster in some cases, although the difference in this case is indeed pretty big. Maybe there is a way to tweak the uu-parsinglib version further (without changing internals as hinted at by that section about greedy parsing in the paper), but it's not obvious (to me anyway).

Well, one thing you could do is rewrite the grammar:

pTest' = pList $ pa *> pList ((\x y -> [x,y]) <$$> pb <*> pa)

For Parsec that would the following I think (but this seems to make it slower):

pTest' = many $ pa *> many (flip (\x y -> [x,y]) <$> pb <*> pa)

This helps, but not enough to beat the Parsec version. Using your benchmark, I get the following results:

 uu-parsinglib (pTest)  - 1.98s
 uu-parsinglib (pTest') - 1.11s
 Parsec (pTest)         - 0.59s
 Parsec (pTest')        - 0.67s
Strohl answered 30/12, 2013 at 21:50 Comment(4)
Thank you very much for this answer - I have read some papers couple months ago about uu-parsinglib internals, but I do not feel as expert in this field. I completely agree with your answer and I still see, that using try prevents us from making a lot of checks (maybe I'm wrong?). Anyway - I've pasted a benchmark (see the updated question), that shows I could be right.Meathead
@danilo2 I don't think you're wrong about that. I've updated my answer as well.Strohl
Hmm, now I'm surprised! First of all I should have think of better grammar to prevent from making such tricks, but I'm happy I did not, because you've shown something very interesting. Your code is all greedy and it is still 2 times slower than the Parsec one. I think It would be good to ask @DoaitseSwierstra about it. I'll send an email to uu-parsinglib mailing list - maybe he will say why we get such results :)Meathead
I just tried one additional change (but now I really should get some sleep): reverse <$> pToken "ba" (or string "ba" for Parsec) instead of (\...) <$> pb <*> pa and that gives me 0.64s for uu-parsinglib and 0.61s for Parsec. pToken is not implemented in terms of pSym/pSatisfy, so this produces one Step instead of two. ("ab" <$ pToken "ba" is probably a bit faster)Strohl

© 2022 - 2024 — McMap. All rights reserved.