Simple parsing questions using PEG.js

Asked 31/8, 2010 at 20:12 Answered 26/1, 2017 at 21:28

I'm trying to wrap my head around PEG by entering simple grammars into the PEG.js playground.

Example 1:

Input: "abcdef1234567ghijklmn8901opqrs"
Desired output: ["abcdef", "1234567", "ghijklmn", "8901", "opqrs"]
Actual output: ["abcdef", ["1234567", ["ghijklmn", ["8901", ["opqrs", ""]]]]]

This example pretty much works, but can I get PEG.js to not nest the resulting array to a million levels? I assume the trick is to use concat() instead of join() somewhere, but I can't find the spot.

start
  = Text

Text
  = Numbers Text
  / Characters Text
  / EOF

Numbers
  = numbers: [0-9]+ {return numbers.join("")}

Characters
  = text: [a-z]+ {return text.join("")}

EOF
  = !.

Example 2:

Same problem and code as Example 1, but change the Characters rule to the following, which I expected would produce the same result.

Characters
  = text: (!Numbers .)+ {return text.join("")}

The resulting output is:

[",a,b,c,d,e,f", ["1234567", [",g,h,i,j,k,l,m,n", ["8901", [",o,p,q,r,s", ""]]]]]

Why do I get all these empty matches?

Example 3:

Last question. This doesn't work at all. How can I make it work? And for bonus points, any pointers on efficiency? For example, should I avoid recursion if possible?

I'd also appreciate a link to a good PEG tutorial. I've read (http://www.codeproject.com/KB/recipes/grammar_support_1.aspx), but as you can see I need more help ...

Input: 'abcdefghijklmnop"qrstuvwxyz"abcdefg'
Desired output: ["abcdefghijklmnop", "qrstuvwxyz", "abcdefg"]
Actual output: "abcdefghijklmnop\"qrstuvwxyz\"abcdefg"

start
  = Words

Words
  = Quote
  / Text
  / EOF

Quote
  = quote: ('"' .* '"') Words {return quote.join("")}

Text
  = text: (!Quote . Words) {return text.join("")}

EOF
  = !.

Hawfinch answered 31/8, 2010 at 20:12 Comment(1)

FYI: This question refers to a very old version of pegjs. Newer versions have an entirely different syntax supporting lots of useful features like text() which references the complete text string that as been matched, as in: digits = [0-9]+ { return text() } – Albertson 26/1, 2017 at 20:53

I received a reply in the PEG.js Google Group that helped me onto the right track. I'm posting answers to all three problems in the hope that they can serve as a rudimentary tutorial for other PEG beginners like myself. Notice that no recursion is needed.

Example 1:

This is straightforward once you understand basic PEG idioms.

start
  = Text+

Text
  = Numbers
  / Characters

Numbers
  = numbers: [0-9]+ {return numbers.join("")}

Characters
  = text: [a-z]+ {return text.join("")}

Example 2:

The problem here is a peculiar design choice in the PEG.js parser generator for Peek expressions (&expr and !expr). Both peek ahead into the input stream without consuming any characters, so I incorrectly assumed that they didn't return anything. However, they both return an empty string. I hope the author of PEG.js changes this behavior, because (as far as I can tell) this is just unnecessary cruft that pollutes the output stream. Please correct me if I'm wrong about this!

Anyway, here is a workaround:

start
  = Text+

Text
  = Numbers
  / Words

Numbers
  = numbers: [0-9]+ {return numbers.join("")}

Words
  = text: Letter+ {return text.join("")}

Letter
  = !Numbers text: . {return text}

Example 3:

The problem is that an expression like ('"' .* '"') can never succeed. PEG is always greedy, so .* will consume the rest of the input stream and never see the second quote. Here is a solution (that incidentally needs the same Peek workaround as in Example 2).

start
  = Words+

Words
  = QuotedString
  / Text

QuotedString
  = '"' quote: NotQuote* '"' {return quote.join("")}

NotQuote
  = !'"' char: . {return char}

Text
  = text: NotQuote+ {return text.join("")}

Hawfinch answered 2/9, 2010 at 11:47 Comment(3)

Typo in the text of Example 3: should read "... and never see the second quote". It appears you can't edit answers to your own questions in Stackoverflow (unless you're logged in I assume). – Hawfinch 2/9, 2010 at 12:10

@NickEvans you just need a bit more rep to be able to edit things. anyway I've fixed that for you. – Williford 19/4, 2013 at 6:26

@NickEvans Please could you accept your answer? That would help people find this answer quicker on search engines. – Braunstein 16/5, 2015 at 22:6

For current versions of pegjs, you might try:

Example One

Input: "abcdef1234567ghijklmn8901opqrs"

Desired output: ["abcdef", "1234567", "ghijklmn", "8901", "opqrs"]

{
  /**
   * Deeply flatten an array.
   * @param  {Array} arr - array to flatten
   * @return {Array} - flattened array
   */
  const flatten = (arr) =>  Array.isArray(arr) ? arr.reduce((flat, elt) => flat.concat(Array.isArray(elt) ? flatten(elt) : elt), []) : arr
}

start = result:string {
  console.log(JSON.stringify(result))
  return result
}

string = head:chars tail:( digits chars? )* {
  return flatten([head,tail])
}

chars = [a-z]+ {
  return text()
}

digits = $[0-9]+ {
  return text()
}

Example 2

Should be easy to deduce from the answer above.

Example 3

Input: 'abcdefghijklmnop"qrstuvwxyz"abcdefg'

Desired output: ["abcdefghijklmnop", "qrstuvwxyz", "abcdefg"]

{
  /**
   * Deeply flatten an array.
   * @param  {Array} arr - array to flatten
   * @return {Array} - flattened array
   */
  const flatten = (arr) =>  Array.isArray(arr) ? arr.reduce((flat, elt) => flat.concat(Array.isArray(elt) ? flatten(elt) : elt), []) : arr
}

start = result:string {
  console.log(JSON.stringify(result))
  return result
}

string = head:chars tail:quote_chars* {
  return flatten([head,tail])
}

quote_chars = DQUOTE chars:chars {
  return chars
}

chars = [a-z]+ {
  return text()
}

DQUOTE = '"'

Albertson answered 26/1, 2017 at 21:28 Comment(1)

is digits = $[0-9]+ { return text() } equal to $[0-9]+? $ equal to text()? – Exorbitant 18/4, 2021 at 7:39

-2

Example 1

start
  = alnums

alnums
  = alnums:(alphas / numbers) {
    return alnums;
  }

alphas
  = alphas:$(alpha+)

numbers
  = numbers:$(number+)

number
  = [0-9]

alpha
  = [a-zA-Z]

Example 2

ignore

Example 3

> 'abcdefghijklmnop"qrstuvwxyz"abcdefg'.split('"')
[ 'abcdefghijklmnop',
  'qrstuvwxyz',
  'abcdefg' ]

Medievalism answered 9/1, 2014 at 6:41 Comment(0)

Example 1:

Example 2:

Example 3:

Example One

Example 2

Example 3

Example 1

Example 2

Example 3

Recommended topics

Hot tags