How to ignore single line comments in a parser-combinator
Asked Answered
D

2

8

I have a working parser, but I've just realised I do not cater for comments. In the DSL I am parsing, comments start with a ; character. If a ; is encountered, the rest of the line is ignored (not all of it however, unless the first character is ;).

I am extending RegexParsers for my parser and ignoring whitespace (the default way), so I am losing the new line characters anyway. I don't wish to modify each and every parser I have to cater for the possibility of comments either, because statements can span across multiple lines (thus each part of each statement may end with a comment). Is there any clean way to acheive this?

Dnieper answered 4/1, 2014 at 19:7 Comment(0)
A
6

One thing that may influence your choice is whether comments can be found within your valid parsers. For instance let's say you have something like:

val p = "(" ~> "[a-z]*".r <~ ")"

which would parse something like ( abc ) but because of comments you could actually encounter something like:

( ; comment goes here
  abc
)

Then I would recommend using a TokenParser or one of its subclass. It's more work because you have to provide a lexical parser that will do a first pass to discard the comments. But it is also more flexible if you have nested comments or if the ; can be escaped or if the ; can be inside a string literal like:

abc = "; don't ignore this" ; ignore this

On the other hand, you could also try to override the value of whitespace to be something like

override protected val whiteSpace = """(\s|;.*)+""".r

Or something along those lines. For instance using the example from the RegexParsers scaladoc:

import scala.util.parsing.combinator.RegexParsers

object so1 {
  Calculator("""(1 + ; foo
  (1 + 2))
  ; bar""")
}

object Calculator extends RegexParsers {
  override protected val whiteSpace = """(\s|;.*)+""".r
  def number: Parser[Double] = """\d+(\.\d*)?""".r ^^ { _.toDouble }
  def factor: Parser[Double] = number | "(" ~> expr <~ ")"
  def term: Parser[Double] = factor ~ rep("*" ~ factor | "/" ~ factor) ^^ {
    case number ~ list => (number /: list) {
      case (x, "*" ~ y) => x * y
      case (x, "/" ~ y) => x / y
    }
  }
  def expr: Parser[Double] = term ~ rep("+" ~ log(term)("Plus term") | "-" ~ log(term)("Minus term")) ^^ {
    case number ~ list => list.foldLeft(number) { // same as before, using alternate name for /:
      case (x, "+" ~ y) => x + y
      case (x, "-" ~ y) => x - y
    }
  }
  def apply(input: String): Double = parseAll(expr, input) match {
    case Success(result, _) => result
    case failure: NoSuccess => scala.sys.error(failure.msg)
  }
}

This prints:

Plus term --> [2.9] parsed: 2.0
Plus term --> [2.10] parsed: 3.0
res0: Double = 4.0
Aircraftman answered 4/1, 2014 at 19:22 Comment(14)
I've played with the parser combinator library in Scala quite a bit, and I'd recommend using a TokenParser for anything besides the most trivial parsers. First, tokenizing first makes the parser both faster and simpler. Second, without a second tokenizing phase it's very hard to get the parser right. I had all sorts of issues with trying to differentiate keywords and identifiers. E.g., I found that my parser would treat if x then y else z and ifxthenyelsez identically unless I added a bunch of negative lookahead stuff to my regexes.Yost
Thanks for your comments. The whitepace override seems interesting. Now that I completed my parser using RegexParsers I don't wish to change everything again to use ToeknParser, although I dont know the difference yet.Dnieper
@Yost The reason you are getting ifxthenyelsez is probably because of the way your regular expressions are structured. I had a similar problem, you might want to check this: #20793558Dnieper
@Aircraftman I tried the whitespace override approach and it doesn't seem to work. I also tried it in this way: override protected val whiteSpace = """(;.*\n)|(\s+)""".r so that hopefully it catches the comments greedily first before the whitespace removes the ending \n. No idea what's happening though, it still complains on a line which starts with a ; although the first line in the file which is a comment is skipped.Dnieper
@jbx, see my example with a slightly different regex for whiteSpaceAircraftman
@Dnieper - I understand why I get that behavior—the issue is that modifying my regexes to fix it is a lot more complicated and error-prone than just using a token parser.Yost
@Aircraftman I had some improvement. What I still have though is that if I have 2 comment lines, one after the other, it fails. If I have one comment line between 2 lines which are processed by my parser it works fine.Dnieper
@Aircraftman You will also get the same problem if after the line Calculator("""(1 + ; foo and before the line (1 + 2)) you add a line ;bazDnieper
@Aircraftman Seems that if I modify the regular expression to this it works fine to cater for multiple comment lines after each other: """\s*((;.*)?\s*)*""".rDnieper
@Dnieper - """(\s|;.*)+""".r should work too, and it's probably a bit easier to read. The problem with the original regex is that it only matches one comment. Both yours and the one I just provided allow multiple comments. One thing to note is that . in a regex does not match a newline by default (although I'm pretty sure there's an option you can enable to change that), but \s does match a newline.Yost
@jbx, it is unfortunate that the scaladoc does not have an example of TokenParsers with Lexical as it is really a more principled approach to handle comments than doing it by trial and error with regex. I had to convert a RegexParsers of mine once (for which I don't have the code handy now) and it's really straightforward once you figure out how to hook up the Lexical class into it. So if the whiteSpace works for you, then great, but if you keep running into corner cases it may be worth it to convert it to a TokenParsers.Aircraftman
@Yost Thanks, I will edit the answer with your regex so that other people looking for a similar answer will find the correct one in the code example provided.Dnieper
@Aircraftman Yes, unfortunately one of the drawbacks with Scala is that there are many alternatives to achieve different flavours of the same thing, and you only find one or two examples of the same thing. And the syntax with all the cryptic operators doesn't help either. Thanks a lot for your suggestions. Its working fine now, marking your response as the answer.Dnieper
@Aircraftman man, that whitespace trick is a clever shortcut i would never have thought of. Going TokenParser is certainly the long term approach, but the whitespace got my parser working for the POC. thanks!Deconsecrate
V
0

Just filter out all the comments with a regex before you pass the code into your parser.

def removeComments(input: String): String = {
  """(?ms)\".*?\"|;.*?$|.+?""".r.findAllIn(input).map(str => if(str.startsWith(";")) "" else str).mkString
}

val code =
"""abc "def; ghij"
abc ;this is a comment
def"""

println(removeComments(code))
Vaccaro answered 4/1, 2014 at 23:5 Comment(2)
Yes, but this won't work if I am using a Reader and will also interfere with line number positioning if I am using Positional output to know where parsing failed.Dnieper
I don't think using lazy quantifiers is a good idea here. That might make the regex match less than the rest of the line.Yost

© 2022 - 2024 — McMap. All rights reserved.