Scala, Parser Combinator for Tree Structured Data

N

2

6

How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.

The EBNF-ish grammer is:

SP          = " "
CRLF        = "\r\n"
level       = "0" | "1" | "2" | "3"
varName     = {alphanum}
varValue    = {alphnum}
recordBegin = "0", varName
recordItem  = level, varName, [varValue]
record      = recordBegin, {recordItem}
file        = {record}

An attempt to implement and test the grammer:

import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val skipWhitespace = false
  def CRLF = "\r\n" | "\n"
  def BOF = "\\A".r
  def EOF = "\\Z".r
  def TXT = "[^\r\n]*".r
  def TXTNOSP = "[^ \r\n]*".r
  def SP = "\\s".r
  def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
  def varName: Parser[String] = SP ~> TXTNOSP
  def varValue: Parser[String] = SP ~> TXT
  def recordBegin: Parser[Any] =  "0" ~ SP ~ varName ~ CRLF
  def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
    {case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
  def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
  def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
  def parse(input: String) = parseAll(file, input)
}

val result = TreeParser.parse(input).get
result.foreach(println)

Neigh answered 30/5, 2011 at 20:44 Comment(0)

V

3

As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the whitespace value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.

As much as possible, try to use the parsers defined in JavaTokenParsers like ident if you want to match alphabetic words.

To ease your error tracing, perform a NoSuccess match on parseAll so you can see at what point the parser failed.

import util.parsing.combinator._

val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val whiteSpace = """[ \t]+""".r

  val level = """[1-3]{1}""".r

  val value = """[a-zA-Z0-9_, ]*""".r
  val eol = """[\r?\n]+""".r

  def recordBegin = "0" ~ ident <~ eol

  def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
    case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
  }

  def record = recordBegin ~> rep1(recordItem)

  def file = rep1(record)

  def parse(input: String) = parseAll(file, input) match {
    case Success(result, _) => result
    case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
  }
}

val result = TreeParser.parse(input)
result.foreach(println)

Vinny answered 31/5, 2011 at 19:2 Comment(2)

Why """ vs " for some regular expressions? – Neigh 1/6, 2011 at 0:40

Triple quote is usually used in regular expressions to treat the string as raw thus avoiding the need to escape backslashes. I believe it is easier to stick to triple quote for all regular expressions even when not needed -see post edit. – Vinny 1/6, 2011 at 14:29

S

3

Handling whitespace explicitly is not a particularly good idea. And, of course, using get means you lose the error message. In this particular example:

[1.3] failure: string matching regex `\s' expected but `f' found

0 fruit

  ^

Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a recordBegin rule, which is defined thusly:

"0" ~ SP ~ varName ~ CRLF

So, it parsers the zero, then the space, and then fruit must be parsed against varName. Now, varName is defined like this:

SP ~> TXTNOSP

Another space! So, fruit should have began with a space.

Sulky answered 31/5, 2011 at 17:50 Comment(2)

Daniel, thanks for the feedback."Handling whitespace explicitly is not a particularly good idea." How then do I terminate the parsing of a recordItem since varValue is optional and may be a string with spaces? – Neigh 31/5, 2011 at 19:5

@Neigh You can override whiteSpace to exclude new lines, and then match on new lines explicitly. By the way, your SP will match newlines, as you defined it. – Sulky 31/5, 2011 at 19:35

V

3