Ruby parslet: parsing multiple lines

Asked 18/7, 2013 at 17:26 Answered 24/7, 2013 at 0:11

I'm looking for a way to match multiple lines Parslet. The code looks like this:

rule(:line) { (match('$').absent? >> any).repeat >> match('$') }
rule(:lines) { line.repeat }

However, lines will always end up in an infinite loop which is because match('$') will endlessly repeat to match end of string.

Is it possible to match multiple lines that can be empty?

irb(main)> lines.parse($stdin.read)
This
is

a
multiline

string^D

should match successfully. Am I missing something? I also tried (match('$').absent? >> any.maybe).repeat(1) >> match('$') but that doesn't match empty lines.

Regards,
Danyel.

Parthenope answered 18/7, 2013 at 17:26 Comment(0)

I think you have two, related, problems with your matching:

The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.
Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.

Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):

require 'parslet'

class Lines < Parslet::Parser
    rule(:text) { match("[^\n]") }
    rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
    rule(:lines) { line.as(:line).repeat }
    root :lines
end

s = "This
is

a
multiline
string"

p Lines.new.parse( s )

The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.

You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.

Exceptional answered 18/7, 2013 at 20:19 Comment(1)

This looks like a nice solution. My workaround was to work with \n, too and to add a newline to the incoming string in order to prevent match failure at the end. This looks cleaner, though. Thanks! – Parthenope 18/7, 2013 at 22:27

I usually define a rule for end_of_line. This is based on the trick in http://kschiess.github.io/parslet/tricks.html for matching end_of_file.

class MyParser < Parslet::Parser
  rule(:cr)         { str("\n") }
  rule(:eol?)       { any.absent? | cr }
  rule(:line_body)  { (eol?.absent? >> any).repeat(1) }
  rule(:line)       { cr | line_body >> eol? }
  rule(:lines?)     { line.repeat (0)}
  root(:lines?)
end

puts MyParser.new.parse(""" this is a line
so is this

that was too
This ends""").inspect

Obviously if you want to do more with the parser than you can achieve with String::split("\n") you will replace the line_body with something useful :)

I had a quick go at answering this question and mucked it up. I just though I would explain the mistake I made, and show you how to avoid mistakes of that kind.

Here is my first answer.

rule(:eol)   { str('\n') | any.absent?  }
rule(:line)  { (eol.absent? >> any).repeat >> eol }
rule(:lines) { line.as(:line).repeat }

I didn't follow my usual rules:

Always make repeat count explicit
Any rule that can match zero length strings, should have name ending in a '?'

So lets apply these...

rule(:eol?)   { str('\n') | any.absent?  } 
# as the second option consumes nothing

rule(:line?)  { (eol.absent? >> any).repeat(0) >> eol? } 
# repeat(0) can consume nothing

rule(:lines?) { line.as(:line?).repeat(0) }
# We have a problem! We have a rule that can consume nothing inside a `repeat`!

Here see why we get an infinite loop. As the input is consumed, you end up with just the end of file, which matches eol? and hence line? (as the line body can be empty). Being inside lines' repeat, it keeps matching without consuming anything and loops forever.

We need to change the line rule so it always consumes something.

rule(:cr)         { str('\n') }
rule(:eol?)       { cr | any.absent?  }
rule(:line_body)  { (eol.absent? >> any).repeat(1) }
rule(:line)       { cr | line_body >> eol? }
rule(:lines?)     { line.as(:line).repeat(0) }

Now line has to match something, either a cr (for empty lines), or at least one character followed by the optional eol?. All repeats have bodies that consume something. We are now golden.

Jessen answered 24/7, 2013 at 0:11 Comment(3)

This turns into an infinite loop for me. – Parthenope 26/7, 2013 at 12:7

oops. yes I'll fix that. – Jessen 28/7, 2013 at 8:57

Infinite loops happen when you have rules that can match without consuming any input. Here line matches an empty line, followed by the any.absent? version of eol which also doesn't consume anything, so it can keep matching. – Jessen 28/7, 2013 at 9:6

I think you have two, related, problems with your matching:

The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.
Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.

Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):

require 'parslet'

class Lines < Parslet::Parser
    rule(:text) { match("[^\n]") }
    rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
    rule(:lines) { line.as(:line).repeat }
    root :lines
end

s = "This
is

a
multiline
string"

p Lines.new.parse( s )

The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.

You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.

Exceptional answered 18/7, 2013 at 20:19 Comment(1)

Recommended topics

Hot tags