Accessing Scala Parser regular expression match data

Asked 29/11, 2009 at 14:49 Answered 6/3, 2011 at 6:32

I wondering if it's possible to get the MatchData generated from the matching regular expression in the grammar below.

object DateParser extends JavaTokenParsers {

    ....

    val dateLiteral = """(\d{4}[-/])?(\d\d[-/])?(\d\d)""".r ^^ {
        ... get MatchData
    }
}

One option of course is to perform the match again inside the block, but since the RegexParser has already performed the match I'm hoping that it passes the MatchData to the block, or stores it?

Parapsychology answered 29/11, 2009 at 14:49 Comment(0)

Here is the implicit definition that converts your Regex into a Parser:

  /** A parser that matches a regex string */
  implicit def regex(r: Regex): Parser[String] = new Parser[String] {
    def apply(in: Input) = {
      val source = in.source
      val offset = in.offset
      val start = handleWhiteSpace(source, offset)
      (r findPrefixMatchOf (source.subSequence(start, source.length))) match {
        case Some(matched) =>
          Success(source.subSequence(start, start + matched.end).toString, 
                  in.drop(start + matched.end - offset))
        case None =>
          Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
      }
    }
  }

Just adapt it:

object X extends RegexParsers {
  /** A parser that matches a regex string and returns the Match */
  def regexMatch(r: Regex): Parser[Regex.Match] = new Parser[Regex.Match] {
    def apply(in: Input) = {
      val source = in.source
      val offset = in.offset
      val start = handleWhiteSpace(source, offset)
      (r findPrefixMatchOf (source.subSequence(start, source.length))) match {
        case Some(matched) =>
          Success(matched,
                  in.drop(start + matched.end - offset))
        case None =>
          Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
      }
    }
  }
  val t = regexMatch("""(\d\d)/(\d\d)/(\d\d\d\d)""".r) ^^ { case m => (m.group(1), m.group(2), m.group(3)) }
}

Example:

scala> X.parseAll(X.t, "23/03/1971")
res8: X.ParseResult[(String, String, String)] = [1.11] parsed: (23,03,1971)

Candiot answered 29/11, 2009 at 17:6 Comment(3)

It's oddly, why such kind of functionality is not part of standard (library's) class implementation? It looks pretty useful, but every user should implement it by self... – Koblick 21/3, 2014 at 4:16

@DmitryBespalov One can simply apply the pattern again to extract groups, and I'd rather use a grammar than more complex regex rules. So, yes, it might be useful but it's not necessary, and there are serious drawbacks in a bloated library. – Candiot 21/3, 2014 at 21:16

It is known that lexing is more efficient with regex that with conventional parsing. Meantime, Lexer with RegexParsers gives object Lexer inherits conflicting member type Elem in trait Scanners and RegexParsers and I cannot not to extend the RegexParsers because it defines the handleWhiteSpace function. – Stereogram 4/1, 2016 at 11:26

No, you can't do this. If you look at the definition of the Parser used when you convert a regex to a Parser, it throws away all context and just returns the full matched string:

http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/util/parsing/combinator/RegexParsers.scala?view=markup#L55

You have a couple of other options, though:

break up your parser into several smaller parsers (for the tokens you actually want to extract)
define a custom parser that extracts the values you want and returns a domain object instead of a string

The first would look like

val separator = "-" | "/"
  val year = ("""\d{4}"""r) <~ separator
  val month = ("""\d\d"""r) <~ separator
  val day = """\d\d"""r

  val date = ((year?) ~ (month?) ~ day) map {
    case year ~ month ~ day =>
      (year.getOrElse("2009"), month.getOrElse("11"), day)
  }

The <~ means "require these two tokens together, but only give me the result of the first one.

The ~ means "require these two tokens together and tie them together in a pattern-matchable ~ object.

The ? means that the parser is optional and will return an Option.

The .getOrElse bit provides a default value for when the parser didn't define a value.

Tufts answered 29/11, 2009 at 16:1 Comment(2)

Thanks David, nice solution. I'm going to go with the custom parser solution as it keeps the grammar definition more readable. – Parapsychology 29/11, 2009 at 21:18

Now that I think of it, a custom parser is also more correct. Each individual regex parser allows leading whitespace, so the code I posted would also match strings like "1999 - 02 - 28". – Tufts 29/11, 2009 at 21:50

When a Regex is used in a RegexParsers instance, the implicit def regex(Regex): Parser[String] in RegexParsers is used to appoly that Regex to the input. The Match instance yielded upon successful application of the RE at the current input is used to construct a Success in the regex() method, but only its "end" value is used, so any captured sub-matches are discarded by the time that method returns.

As it stands (in the 2.7 source I looked at), you're out of luck, I believe.

Decern answered 29/11, 2009 at 15:48 Comment(0)

I ran into a similar issue using scala 2.8.1 and trying to parse input of the form "name:value" using the RegexParsers class:

package scalucene.query

import scala.util.matching.Regex
import scala.util.parsing.combinator._

object QueryParser extends RegexParsers {
  override def skipWhitespace = false

  private def quoted = regex(new Regex("\"[^\"]+"))
  private def colon = regex(new Regex(":"))
  private def word = regex(new Regex("\\w+"))
  private def fielded = (regex(new Regex("[^:]+")) <~ colon) ~ word
  private def term = (fielded | word | quoted)

  def parseItem(str: String) = parse(term, str)
}

It seems that you can grab the matched groups after parsing like this:

QueryParser.parseItem("nameExample:valueExample") match {
  case QueryParser.Success(result:scala.util.parsing.combinator.Parsers$$tilde, _) => {
      println("Name: " + result.productElement(0) + " value: " + result.productElement(1))
  }
}

Benne answered 6/3, 2011 at 6:32 Comment(0)

Recommended topics

Hot tags