How to skip whitespace but use it as a token delimeter in a parser combinator
Asked Answered
M

1

2

I am trying to build a small parser where the tokens (luckily) never contain whitespace. Whitespace (spaces, tabs and newlines) are essentially token delimeters (apart from cases where there are brackets etc.).

I am extending the RegexParsers class. If I turn on skipWhitespace the parser is greedily joining tokens together when the next token matches the regular expression of the previous one. If I turn off skipWhitespace, on the other hand, it complains because of the spaces not being part of the definition. I am trying to match the BNF as much as possible, and given that whitespace is almost always the delimeter (apart from brackets or some other cases where the delimeter is explicitly defined in the BNF), is there away to avoid putting whitespace regex in all my definitions?

UPDATE

This is a small test example where the tokens are being joined together:

import scala.util.parsing.combinator.RegexParsers

object TestParser extends RegexParsers {
  def test  = "(test" ~> name <~ ")"

  def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}

  def anyChar = letter | digit | "_".r | "-".r
  def letter = """[a-zA-Z]""".r
  def digit = """\d""".r

  def main(args: Array[String]) {

    val s = "(test hello these should not be joined and I should get an error)"

    val res = parseAll(test, s)
    res match {
      case Success(r, n) => println(r)
      case Failure(msg, n) => println(msg)
      case Error(msg, n) => println(msg)
    }

  }

}

In the above case I just get the string joined together. A similar effect is if I change test to the following, expecting it to give me the list of separate words after test, but instead it joins them together and just gives me a one element list with a long string, without the middle spaces:

def test  = "(test" ~> (name+) <~ ")"
Maneater answered 27/12, 2013 at 0:21 Comment(2)
IIRC, skipping whitespace is only done at the start of each token, up until the first non-whitespace character is found. That is at odds with what you say is happening, so could you please provide sample code and test case?Sealy
@DanielC.Sobral Added a small example that shows it happening.Maneater
S
4

White space is skipped just before every production rule. So, in this snippet:

def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}

It will skip whitespace before each letter and, even worse, each empty string for good measure (since anyChar* can be empty).

Use regular expressions (or plain strings) for each token, not each lexical element. Like this:

object TestParser extends RegexParsers {
  def test  = "(test" ~> name <~ ")"
  def name : Parser[String] = """[a-zA-Z][a-zA-Z0-9_-]*""".r

  // ...
Sealy answered 27/12, 2013 at 3:0 Comment(3)
Not sure I got the empty string part. So is it because I have a separate anyChar rule? So before that the spaces would be skipped once again, but the anyChar would still need to be combined in the higher level name parser? So the small anyChars are gobbling up more tokens one by one?Maneater
@Maneater Yes. The empty string is a separate issue that is not really relevant to the problem at hand. Before each parser, and not only anyChar is a parser, but even letter and digit are implicitly converted into parsers, spaces are skipped.Sealy
You cannot place semantic action onto the regex. Result of the regex produces a plain, unstructured string.Radcliff

© 2022 - 2024 — McMap. All rights reserved.