Scala parser combinator, large file issue
Asked Answered
G

1

13

I have written a parser as follows:

class LogParser extends JavaTokenParsers {

  def invertedIndex: Parser[Array[Array[(Int, Int)]]] = {
    num ~> num ~> num ~> rep(postingsList) ^^ {
      _.toArray
    }
  }

  def postingsList: Parser[Array[(Int, Int)]] = {
    num ~> rep(entry) ^^ {
      _.toArray
    }
  }

  def entry = {
    num ~ "," ~ num ^^ {
      case docID ~ "," ~ count => (docID.toInt, count.toInt)
    }
  }

  def num = wholeNumber ^^ (_.toInt)

}

If I parse from a (270MB) file with a FileReader as follows:

val index = parseAll(invertedIndex, new FileReader("path/to/file")).get

I get an Exception in thread "main" java.lang.StackOverflowError (I have also tried wrapping in a BufferedReader) but I can fix it by first reading the file into a String like so:

val input = io.Source.fromFile("path/to/file")
val str = input.mkString
input.close()
val index = parseAll(invertedIndex, str).get

Why is this the case? Is there any way to avoid reading it as a String first, it seems a waste?

Grados answered 3/11, 2012 at 21:24 Comment(5)
What's the current size of your stack, and how much larger do you have to make your stack to avoid the StackOverflowException? How much smaller must the stack be to make the String version overflow? (You can set your stack to 16MB by launching like so: scala -J-Xss16M)Logway
I was just using the default stack size, but when I set it to 16M the program was still running 30mins later...Grados
This could be related to the Scala 2.9.2 bug SI-6520.Impalpable
Looking at the bug @DonRoby liked to and thinking about how parser combinators work, I think the backtracking mechanism would require the parser to have the entire file in memory to enable backtracking anyway. This does sound like a bug in the PagedSeq bug, so you're probably best off just reading it into a string.Logway
I've run into this problem before, and have also used the string workaround. I've also just created a new issue (SI-6612; with an example that isn't parser combinator-specific), since this seems fairly distinct from SI-6520—the problem is blowing the stack, not running out of memory.Rollick
J
1

There is another library[1] that is a lot like the scala parser combinators that supports Trampolining which is what you need to stop the stackoverflow errors.

[1] https://github.com/djspiewak/gll-combinators

Jerz answered 16/11, 2012 at 2:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.