Combining lexer and parser in a parser combinator

Asked 13/8, 2013 at 16:17 Answered 15/8, 2013 at 8:23

Solved parsing haskell parser-combinators uu-parsinglib

I'm using uu-parsinglib, but I think the following question is parser combinator generic.

Let's consider the following example:

I've got a lexer with a combinator pLex, which produces a list of tokens (of type MyToken). I now want to write a parser, which will consume the tokens and build an AST.

What is the best way to connect the lexer and parser? Right now I have a lex function:

lex s = parse ( (,) <$> pLex <*> pEnd) (createStr (LineColPos 0 0 0) s)

Should I create a function parse p = ...? If yes, how do I construct it to keep track of columns and lines from lexer? Or should I create a parserCombinator, which would use the pLex combinator somehow?

Osteotomy answered 13/8, 2013 at 16:17 Comment(9)

On my phone, so I'll just point for now: in the BasicInstances module there's the Str type and createStr which can be used to create token streams. – Rusticate 13/8, 2013 at 18:29

Other parsing combinator libraries also subsume lexing sometimes. For this see the Token Parsing modules in Parsec or parsers. Generally that's the direction you go to use parser combinators. But you can always just feed the output of your lever like [MyToken] in as input to a token parser. – Rusticate 13/8, 2013 at 18:31

@J.Abrahamson: I've got pretty big grammar, so I dont know if susbsuming the lexer and parser is good idea (but maybe using parser combinators this is the way to go?) I've got a problem with this, because for example my combinator "pString" returns a list of tokens (because in my language stirngs cna have varibles like "test: $name". So generally I've written some "end" combinators, which results in lists of tokens and then I'm summing them together. Is this a bad approach? – Osteotomy 13/8, 2013 at 18:41

@J.Abrahamson: I would be very thankful if you'll tell me how you'll design the parser if you'll have some combinators, which have to return a list of tokens (like the string above, or pNewline, which should return Newline and Indent Int (indentation of new line) tokens). The tokenstream may work, but maybe there is a better approach? – Osteotomy 13/8, 2013 at 19:5

You could fire off subparsers on each tokenstream generated during your main parse. So if tokenizeSegment :: P String [MyToken] and parseSegment :: P [MyToken] SegmentAST then do something like parse parseSegment <$> tokenizeSegment :: P String (Either ParseError SegmentAST). You can also handle that error by deconstructing the Either and passing the ParseError up to the main parser. – Rusticate 13/8, 2013 at 20:4

@J.Abrahamson: hmmm... But as far as I understand it is not possible always to parse result of tokenizeSegment. For example the result could be [Newline, Indentation 2] (Indentation means the intentation level) and you cannot convert that to any AST structure unless you know the next segment? – Osteotomy 13/8, 2013 at 20:20

I have an idea, which could solve it - my pToken combinator could always return one token if I can check a state inside of it (for example after newline I'll switch to a newline state etc) - but then I have to somehow convert uu-parsinglib monads to support state. – Osteotomy 13/8, 2013 at 20:29

To make a normal string parser via parser combinators, you would write a string parser that says, 'a string is a quote mark followed by zero or more non-quote characters and terminated by another quote character`. For a string with splices, it would be 'zero or more (non-quote char or splice)' and then you'd provide a splice parser as a separate function, which says 'a splice is a dollar sign followed by...' etc. – Plumbing 13/8, 2013 at 23:36

See Graham Hutton's paper 'Higher Order Functions for Parsing' for the definition of a parsing combinator that handles the 'offside rule', which is what Haskell calls its indentation-sensitive syntax rule. Since uu-parsinglib stores the current input position in its state, you should be able to attach the row/col position of the beginning of each token in the return value of each token parser and implement a combinator based on that idea directly. – Plumbing 14/8, 2013 at 0:14

Table-based parsers require separation of lexical analysis and parsing because of their limited lookahead capability. Looking ahead far enough to combine lexical analysis into the parser would explode the state space.

Combinator-based approaches do not usually suffer this problem, as they are typically doing recursive-descent parsing. Unless otherwise noted by the library author, there is no harm in combining the phases and not much to gain by separating them.

Although uu-parsinglib provides the Str class to abstract over different string-like inputs, looking at its definition shows that it still assumes that you are ultimately reading a sequence of Char, whether they be from a String, ByteString, Text, etc. So trying to get it to parse a MyToken stream seems like it could be difficult. Parsec might be a better choice if you feel you need to do that.

As to your question about your string implementation, combinators take a string-like input containing syntactic structure and return the corresponding semantic value, if they match. Inside the combinator, you get to build that semantic value from what you parse directly by taking from the input stream and by combining the semantic values from sub-combinators you call.

So, your 'String matching' combinator in your example will have a list of tokens in its scope thanks to the parsing it did. You can use the full power of Haskell to combine those tokens into a single MyString value in whatever way makes sense for your language: Maybe a 'SplicedString' type that represents what values are to be sliced into it.

The string combinator was probably called by an 'expression' combinator, which will be able to combine the MyString value with other parsed values into a MyExpression value. It's combinators returning semantic values all the way back up!

Plumbing answered 13/8, 2013 at 20:46 Comment(5)

Thank you for your answer - it seems that maybe I should not separate the parsing step into lexing and parsing. I was always feeling that creating "lexer/toenizer" and "parser" is very "pure", but now I'm slowly changing my mind :) – Osteotomy 13/8, 2013 at 20:56

It still might make sense to compartmentalize them somewhat, putting your token-matching combinators in one module and your parsing combinators in another. But an interesting thing about building your parser down to the lexeme level is that you can do lexical analysis differently depending on the parsing context. For example, your special strings could use a different set of combinators for lexical analysis of splicing inside strings. – Plumbing 13/8, 2013 at 21:57

Thank you :) Do you know (by the way) if there is a simple way to create "a state" in uu-parsinglib? I mean - to create a combinator, which could use put and get from StateMonad? – Osteotomy 13/8, 2013 at 22:42

I answered this question in the other post you did, as well a hint towards how you could make a Token parser, since the two are related. If you found my answers helpful, don't forget to upvote/mark as answered. Thanks! – Plumbing 13/8, 2013 at 23:29

Than you :) I never forgot to upvote / mark as answered :) I sometimes want simply to read the answer, check if I can progress with it and ask further if something appears not clear :) Thank you once again for the help! :) – Osteotomy 14/8, 2013 at 0:0

I think there is nothing in uu-parsinglib which prevents you from using an input different from Text. It is only that for Text (and friends) we have provided quite some functions you are likely to need. If you look at the older uulib parser combinators you will find a scanner based approach, which can be used just as well with the newer uu-parsinglib.

If you want to process a lot of data maybe it is better to have separate scannning phase. Error messages tend to be more informative. In the uulib you will find some support for writing your scanner (most languages somehow put some special restrictions/requirements on lexical structure that quite some tools will (fail/need to be adapted) to create your scanner (e.g. the offside rule))

Wylen answered 15/8, 2013 at 8:23 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags