for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. (§5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n
is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char'
combinator that behaves like char
but looks an extra character ahead and will silently consume any \\\n
. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char
with char'
in any parser, even in the megaparsec-provided ones like string
, integer
, whitespace
etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?
lexeme
combinator which first executes another parser and then consumes all whitespace after the token. Note that comments can only appear between tokens, hence handling it inlexeme
is sufficient. In contrast,\\\n
can appear anywhere, even inside a string literal or a decimal etc... – Berger#line
directives too. So I think maybe you'll need to pair your chars and identifiers up with some kind ofLocationSpan
type. – IgnominyParsecT
in a newtype and write a modifiedMonadParsec
instance for it, one in which functions liketoken
andtokens
skipped line splices. That way you wouldn't need to modify "derived" combinators likestring
. – Handicapped