Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec
Asked Answered
B

0

8

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:

Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. (§5.1.1., ISO/IEC9899:201x)

So far we came up with two possible approaches to this problem:

1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.

2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...

Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Berger answered 2/11, 2017 at 21:58 Comment(5)
Isn't this basically the same problem you're going to have with comments? There are characters in the input stream that you wish to pretend don't exist, but still need to be tracked for accurate line and column numbers. How do you handle comments? Can you handle these escape sequences in the same way?Crossindex
@amalloy: It's not exactly the same. Handling comments happens inside our lexeme combinator which first executes another parser and then consumes all whitespace after the token. Note that comments can only appear between tokens, hence handling it in lexeme is sufficient. In contrast, \\\n can appear anywhere, even inside a string literal or a decimal etc...Berger
The C preprocessor is annoyingly resilient against functional refactoring. Keep in mind you'll probably eventually want to handle #line directives too. So I think maybe you'll need to pair your chars and identifiers up with some kind of LocationSpan type.Ignominy
Perhaps you could wrap ParsecT in a newtype and write a modified MonadParsec instance for it, one in which functions like token and tokens skipped line splices. That way you wouldn't need to modify "derived" combinators like string.Handicapped
@danidiaz, We haven't implemented it as our instructor told us to skip this part of the specification, but I think this is the cleanest way to do it. Thank you for the suggestion! If you add this is an answer I will accept it.Berger

© 2022 - 2024 — McMap. All rights reserved.