non-greedy matching in Scala RegexParsers
Asked Answered
V

1

13

Suppose I'm writing a rudimentary SQL parser in Scala. I have the following:

class Arith extends RegexParsers {
    def selectstatement: Parser[Any] = selectclause ~ fromclause
    def selectclause: Parser[Any] = "(?i)SELECT".r ~ tokens
    def fromclause: Parser[Any] = "(?i)FROM".r ~ tokens
    def tokens: Parser[Any] = rep(token) //how to make this non-greedy?
    def token: Parser[Any] = "(\\s*)\\w+(\\s*)".r
}

When trying to match selectstatement against SELECT foo FROM bar, how do I prevent the selectclause from gobbling up the entire phrase due to the rep(token) in ~ tokens?

In other words, how do I specify non-greedy matching in Scala?

To clarify, I'm fully aware that I can use standard non-greedy syntax (*?) or (+?) within the String pattern itself, but I wondered if there's a way to specify it at a higher level inside def tokens. For example, if I had defined token like this:

def token: Parser[Any] = stringliteral | numericliteral | columnname

Then how can I specify non-greedy matching for the rep(token) inside def tokens?

Vinavinaceous answered 18/10, 2011 at 19:33 Comment(1)
It seems like we are dealing with PEG feature here: Whereas regular expression matchers may start by matching greedily, but will then backtrack and try shorter matches if they fail and CFG tries every possibility, PEG's *, +, and ? operators always behave greedily, consuming as much input as possible and never backtracking: Expression a* will always consume as many a's as are consecutively available in the input string, causing (a* a) to fail always.Hone
S
12

Not easily, because a successful match is not retried. Consider, for example:

object X extends RegexParsers {
  def p = ("a" | "aa" | "aaa" | "aaaa") ~ "ab"
}

scala> X.parseAll(X.p, "aaaab")
res1: X.ParseResult[X.~[String,String]] = 
[1.2] failure: `ab' expected but `a' found

aaaab
 ^

The first match was successful, in parser inside parenthesis, so it proceeded to the next one. That one failed, so p failed. If p was part of alternative matches, the alternative would be tried, so the trick is to produce something that can handle that sort of thing.

Let's say we have this:

def nonGreedy[T](rep: => Parser[T], terminal: => Parser[T]) = Parser { in =>
  def recurse(in: Input, elems: List[T]): ParseResult[List[T] ~ T] =
    terminal(in) match {
      case Success(x, rest) => Success(new ~(elems.reverse, x), rest)
      case _ => 
        rep(in) match {
          case Success(x, rest) => recurse(rest, x :: elems)
          case ns: NoSuccess    => ns
        }
    }

  recurse(in, Nil)
}  

You can then use it like this:

def p = nonGreedy("a", "ab")

By the way,I always found that looking at how other things are defined is helpful in trying to come up with stuff like nonGreedy above. In particular, look at how rep1 is defined, and how it was changed to avoid re-evaluating its repetition parameter -- the same thing would probably be useful on nonGreedy.

Here's a full solution, with a little change to avoid consuming the "terminal".

trait NonGreedy extends Parsers {
    def nonGreedy[T, U](rep: => Parser[T], terminal: => Parser[U]) = Parser { in =>
      def recurse(in: Input, elems: List[T]): ParseResult[List[T]] =
        terminal(in) match {
          case _: Success[_] => Success(elems.reverse, in)
          case _ => 
            rep(in) match {
              case Success(x, rest) => recurse(rest, x :: elems)
              case ns: NoSuccess    => ns
            }
        }

      recurse(in, Nil)
    }  
}

class Arith extends RegexParsers with NonGreedy {
    // Just to avoid recompiling the pattern each time
    val select: Parser[String] = "(?i)SELECT".r
    val from: Parser[String] = "(?i)FROM".r
    val token: Parser[String] = "(\\s*)\\w+(\\s*)".r
    val eof: Parser[String] = """\z""".r

    def selectstatement: Parser[Any] = selectclause(from) ~ fromclause(eof)
    def selectclause(terminal: Parser[Any]): Parser[Any] = 
      select ~ tokens(terminal)
    def fromclause(terminal: Parser[Any]): Parser[Any] = 
      from ~ tokens(terminal)
    def tokens(terminal: Parser[Any]): Parser[Any] = 
      nonGreedy(token, terminal)
}
Sidereal answered 18/10, 2011 at 22:26 Comment(11)
I noticed that changing to def p = ("a" ||| "aa" ||| "aaa") ~ "ab" does parse in your example, but I'm not clear on what the ||| operator is really doing or whether it has any bearing on the original questionVinavinaceous
@Vinavinaceous ||| will just select the largest match, which happens to be the correct one. Add one ||| "aaaa" and you'll see it fail.Sidereal
Thanks for this def nonGreedy solution. I don't understand how to apply it... nonGreedy takes two arguments, but the rep(token) that I'm trying to make non-greedy takes one arg. What should the args be in my particular case?Vinavinaceous
Wow, it works! Impressive solution... not intuitive at all to me. Thanks againVinavinaceous
Four years later this is still a useful answer.Holmun
I am trying to use this but uncertain what is intended by the "terminal". A blurb in your answer would be appreciated. Is it a semicolon (terminating a sql statement) ?Holmun
@javadba The terminal is what terminates that parser. If you look at the last example, a "select" is terminated once "from" is found, and "from" is terminated once the input is finished.Sidereal
Wouldn't parser generator be more advantageous is such cases? I see that parser combinators often end up with stack overflow whereas parser generators will tell you about the left recursion at compile time.Koehn
@ValentinTihomirov You mean something like YACC? I think they are clumsy and I'd take a good parser DSL any day. That said, it's been a while since I used Scala's built-in parser; there are much better alternatives available nowadays.Sidereal
Why do you say that DSL parsers are clumsy and you would use them instead? It seems like a contradictory statement to me. Which alternatives are you talking about? FastParse is a replacement but authors shay that it outperform only by speed. JavaCC is said to produce a pretty decent parser. The only problem is that it is Java, which means undesirable extra compilation step/tool.Koehn
@ValentinTihomirov YACC is a language (and a tool), which is compiled into C code that parses the grammar it describes. Scala Parsers is a DSL in Scala used to describe grammars for parsing. The first requires an extra compilation step, and the source code it produces cannot be modified, since it's, of course, generated. For that reason I prefer the second, DSLs directly encoded in a target language. YACC, ANTLR = bad. Scala Parsers, Parboiled2 = good. And Parboiled2 is what I have used lately.Sidereal

© 2022 - 2024 — McMap. All rights reserved.