Indentation sensitive parser using Parslet in Ruby?
Asked Answered
C

2

10

I am attempting to parse a simple indentation sensitive syntax using the Parslet library within Ruby.

The following is an example of the syntax I am attempting to parse:

level0child0
level0child1
  level1child0
  level1child1
    level2child0
  level1child2

The resulting tree would look like so:

[
  {
    :identifier => "level0child0",
    :children => []
  },
  {
    :identifier => "level0child1",
    :children => [
      {
        :identifier => "level1child0",
        :children => []
      },
      {
        :identifier => "level1child1",
        :children => [
          {
            :identifier => "level2child0",
            :children => []
          }
        ]
      },
      {
        :identifier => "level1child2",
        :children => []
      },
    ]
  }
]

The parser that I have now can parse nesting level 0 and 1 nodes, but cannot parse past that:

require 'parslet'

class IndentationSensitiveParser < Parslet::Parser

  rule(:indent) { str('  ') }
  rule(:newline) { str("\n") }
  rule(:identifier) { match['A-Za-z0-9'].repeat.as(:identifier) }

  rule(:node) { identifier >> newline >> (indent >> identifier >> newline.maybe).repeat.as(:children) }

  rule(:document) { node.repeat }

  root :document

end

require 'ap'
require 'pp'

begin
  input = DATA.read

  puts '', '----- input ----------------------------------------------------------------------', ''
  ap input

  tree = IndentationSensitiveParser.new.parse(input)

  puts '', '----- tree -----------------------------------------------------------------------', ''
  ap tree

rescue IndentationSensitiveParser::ParseFailed => failure
  puts '', '----- error ----------------------------------------------------------------------', ''
  puts failure.cause.ascii_tree
end

__END__
user
  name
  age
recipe
  name
foo
bar

It's clear that I need a dynamic counter that expects 3 indentation nodes to match a identifier on the nesting level 3.

How can I implement an indentation sensitive syntax parser using Parslet in this way? Is it possible?

Custer answered 12/5, 2013 at 6:56 Comment(1)
Not sure if this is better done as parse/build separate stages. Pretty much any combination of indentation levels would be valid and parse, so to me this points to a very simple line-based parser that just captures indentation level, then something that takes the parser output and builds your nested structure.Noticeable
J
14

There are a few approaches.

  1. Parse the document by recognising each line as a collection of indents and an identifier, then apply a transformation afterwards to reconstruct the hierarchy based on the number of indents.

  2. Use captures to store the current indent and expect the next node to include that indent plus more to match as a child (I didn't dig into this approach much as the next one occurred to me)

  3. Rules are just methods. So you can define 'node' as a method, which means you can pass parameters! (as follows)

This lets you define node(depth) in terms of node(depth+1). The problem with this approach, however, is that the node method doesn't match a string, it generates a parser. So a recursive call will never finish.

This is why dynamic exists. It returns a parser that isn't resolved until the point it tries to match it, allowing you to now recurse without problems.

See the following code:

require 'parslet'

class IndentationSensitiveParser < Parslet::Parser

  def indent(depth)
    str('  '*depth)
  end

  rule(:newline) { str("\n") }

  rule(:identifier) { match['A-Za-z0-9'].repeat(1).as(:identifier) }

  def node(depth) 
    indent(depth) >> 
    identifier >> 
    newline.maybe >> 
    (dynamic{|s,c| node(depth+1).repeat(0)}).as(:children)
  end 

  rule(:document) { node(0).repeat }

  root :document
end

This is my favoured solution.

Jorgan answered 15/5, 2013 at 10:54 Comment(2)
You, sir, have blown my mind. I have been attempting approach number 2 for some time now. My problem was that I didn't have a clear understanding of dynamic as the docs don't dive into the subject as much as I'd like. Thank you so much for the answer! This parser will likely provide the groundwork for many others in my future :pCuster
Thanks a lot for this, I've used this solution in github.com/alphagov/smartdownOutsail
O
0

I don't like the idea of weaving knowledge of the indentation process through the whole grammar. I would rather just have INDENT and DEDENT tokens produced that other rules could use similarly to just matching "{" and "}" characters. So the following is my solution. It is a class IndentParser that any parser can extend to get nl, indent, and decent tokens generated.

require 'parslet'

# Atoms returned from a dynamic that aren't meant to match anything.
class AlwaysMatch < Parslet::Atoms::Base
  def try(source, context, consume_all)
    succ("")
  end
end
class NeverMatch < Parslet::Atoms::Base
  attr_accessor :msg
  def initialize(msg = "ignore")
    self.msg = msg
  end
  def try(source, context, consume_all)
    context.err(self, source, msg)
  end
end
class ErrorMatch < Parslet::Atoms::Base
  attr_accessor :msg
  def initialize(msg)
    self.msg = msg
  end
  def try(source, context, consume_all)
    context.err(self, source, msg)
  end
end

class IndentParser < Parslet::Parser

  ##
  # Indentation handling: when matching a newline we check the following indentation. If
  # that indicates an indent token or detent tokens (1+) then we stick these in a class
  # variable and the high-priority indent/dedent rules will match as long as these 
  # remain. The nl rule consumes the indentation itself.

  rule(:indent)  { dynamic {|s,c| 
    if @indent.nil?
      NeverMatch.new("Not an indent")
    else
      @indent = nil
      AlwaysMatch.new
    end
  }}
  rule(:dedent)  { dynamic {|s,c|
    if @dedents.nil? or @dedents.length == 0
      NeverMatch.new("Not a dedent")
    else
      @dedents.pop
      AlwaysMatch.new
    end
  }}

  def checkIndentation(source, ctx)
    # See if next line starts with indentation. If so, consume it and then process
    # whether it is an indent or some number of dedents.
    indent = ""
    while source.matches?(Regexp.new("[ \t]"))
      indent += source.consume(1).to_s #returns a Slice
    end

    if @indentStack.nil?
      @indentStack = [""]
    end

    currentInd = @indentStack[-1]
    return AlwaysMatch.new if currentInd == indent #no change, just match nl

    if indent.start_with?(currentInd)
      # Getting deeper
      @indentStack << indent
      @indent = indent #tells the indent rule to match one
      return AlwaysMatch.new
    else
      # Either some number of de-dents or an error

      # Find first match starting from back
      count = 0
      @indentStack.reverse.each do |level|
        break if indent == level #found it, 

        if level.start_with?(indent)
          # New indent is prefix, so we de-dented this level.
          count += 1
          next
        end

        # Not a match, not a valid prefix. So an error!
        return ErrorMatch.new("Mismatched indentation level")
      end

      @dedents = [] if @dedents.nil?
      count.times { @dedents << @indentStack.pop }
      return AlwaysMatch.new
    end
  end
  rule(:nl)         { anynl >> dynamic {|source, ctx| checkIndentation(source,ctx) }}

  rule(:unixnl)     { str("\n") }
  rule(:macnl)      { str("\r") }
  rule(:winnl)      { str("\r\n") }
  rule(:anynl)      { unixnl | macnl | winnl }

end

I'm sure a lot can be improved, but this is what I've come up with so far.

Example usage:

class MyParser < IndentParser
  rule(:colon)      { str(':') >> space? }

  rule(:space)      { match(' \t').repeat(1) }
  rule(:space?)     { space.maybe }

  rule(:number)     { match['0-9'].repeat(1).as(:num) >> space? }
  rule(:identifier) { match['a-zA-Z'] >> match["a-zA-Z0-9"].repeat(0) }

  rule(:block)      { colon >> nl >> indent >> stmt.repeat.as(:stmts) >> dedent }
  rule(:stmt)       { identifier.as(:id) >> nl | number.as(:num) >> nl | testblock }
  rule(:testblock)  { identifier.as(:name) >> block }

  rule(:prgm)       { testblock >> nl.repeat }
  root :prgm
end
Original answered 26/4, 2014 at 1:37 Comment(1)
I found a problem with my answer. Since PEGs greedily consume the next match without considering non-deterministic alternatives, nothing checks the dedent/indent rules when adding statements inside a block. So if a child-block is followed by a dedent and more statements those stmts get greedily added to the child block ignoring the dedent. To fix, either weave same-dent rules thru grammar (yuck) or would need to insert tokens in the input stream like a lexer can.Original

© 2022 - 2024 — McMap. All rights reserved.