Scala parser combinators and newline-delimited text
Asked Answered
W

1

13

I am writing a Scala parser combinator grammar that reads newline-delimited word lists, where lists are separated by one or more blank lines. Given the following string:

cat
mouse
horse

apple
orange
pear

I would like to have it return List(List(cat, mouse, horse), List(apple, orange, pear)).

I wrote this basic grammar which treats word lists as newline-delimited words. Note that I had to override the default definition of whitespace.

import util.parsing.combinator.RegexParsers

object WordList extends RegexParsers {

    private val eol = sys.props("line.separator")

    override val whiteSpace = """[ \t]+""".r

    val list: Parser[List[String]] = repsep( """\w+""".r, eol)

    val lists: Parser[List[List[String]]] = repsep(list, eol)

    def main(args: Array[String]) {
        val s =
          """cat
            |mouse
            |horse
            |
            |apple
            |orange
            |pear""".stripMargin

        println(parseAll(lists, s))
    }
}

This incorrectly treats blank lines as empty word lists, i.e. it returns

[8.1] parsed: List(List(cat, mouse, horse), List(), List(apple, orange, pear))

(Note the empty list in the middle.)

I can put an optional end of line at the end of each list.

val list: Parser[List[String]] = repsep( """\w+""".r, eol) <~ opt(eol)

This handles the case where there is a single blank line between lists, but has the same problem with multiple blank lines.

I tried changing the lists definition to allow multiple end-of-line delimiters:

val lists:Parser[List[List[String]]] = repsep(list, rep(eol))

but this hangs on the above input.

What is the correct grammar that will handle multiple blank lines as delimiters?

Whiting answered 13/11, 2012 at 3:35 Comment(0)
A
14

You should try setting skipWhitespace to false instead of redefining the definition of whitespace. The issue you're having with the empty list is caused by the fact that repsep doesn't consume the line break at the end of the list. Instead, you should parse the line break (or possibly end of input) after each item:

import util.parsing.combinator.RegexParsers

object WordList extends RegexParsers {

  private val eoi = """\z""".r // end of input
  private val eol = sys.props("line.separator")
  private val separator = eoi | eol
  private val word = """\w+""".r

  override val skipWhitespace = false

  val list: Parser[List[String]] = rep(word <~ separator)

  val lists: Parser[List[List[String]]] = repsep(list, rep1(eol))

  def main(args: Array[String]) {
    val s =
      """cat
        |mouse
        |horse
        |
        |apple
        |orange
        |pear""".stripMargin

    println(parseAll(lists, s))
  }

}

Then again, parser combinators are a bit overkill here. You could get practically the same thing (but with Arrays instead of Lists) with something much simpler:

s.split("\n{2,}").map(_.split("\n"))
Adumbral answered 13/11, 2012 at 4:32 Comment(4)
That works if there is only one blank line between word lists. If there are n blank lines we end up with n-1 bogus empty lists in the middle. (BTW: the skipWhitespace and eoi examples are very helpful.)Whiting
@W.P.McNeill - I updated the code to look for rep1(eol) between lists of strings. Is that what you were going for?Adumbral
rep1(eol) is what I was looking for. Thanks. I know that parser combinators are overkill here. I deliberately simplified the problem for the purposes of exposition.Whiting
In that case, +1 for deliberately simplifying the problem for purposes of exposition!Adumbral

© 2022 - 2024 — McMap. All rights reserved.