Using PEG Parser for BBCode Parsing: pegjs or ... what?

F

3

7

I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.

My specific questions are:

As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:

[
       [
          "[",
          "img",
          "",
          [
             "/",
             "f",
             "o",
             "o",
             "/",
             "b",
             "a",
             "r"
          ],
          "",
          "]"
       ],
       [
          "[/",
          "img",
          "]"
       ]
    ]

given a grammar like:

document
   = (open_tag / close_tag / new_line / text)*

open_tag
   = ("[" tag_name "="? tag_data? tag_attributes? "]")


close_tag
   = ("[/" tag_name "]") 

text
   = non_tag+

non_tag
   = [\n\[\]]

new_line
   = ("\r\n" / "\n")

I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.

Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?

Thanks

Fingerstall answered 29/6, 2012 at 20:13 Comment(3)

Steve, your question is very interesting (+1), I just want to do the same thing in an extension: parsing BBCode in a textarea (unfortunately this is the format a forum is still using), and create a "live" preview from the typed text using PEG.js or anything else except regular expressions. Did you manage to create the grammar for the BBCode parser? Couldn't you please share your solution via GitHub or anything else? That would help me a lot. Thanks very much in advance! – Adulterant 28/5, 2015 at 12:10

I used patorjk's bbcode parser. Works great and can be tweaked to your own needs if you have special tags. – Fingerstall 28/5, 2015 at 16:3

Thanks, I've already seen this library, but it uses regular expressions, which I wanted to avoid, because theoretically, parsing BBCode using regular expressions can not be done without faults (»»link) in some cases, e.g. when nesting them in each other, etc. That's why I wanted to do it using parsing expression grammar formalism. So didn't you try to make improvements to the grammar you started? :) Couldn't you share the basis of it? :) – Adulterant 28/5, 2015 at 18:26

H

3

Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.

The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example

text
   = text:non_tag+ {
     // we captured the text in an array and can manipulate it now
     return text.join("");
   }

At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.

Heriot answered 23/7, 2012 at 20:7 Comment(0)

T

4

First question (grammar for incomplete texts):

You can add

incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
//                         the closing bracket is omitted ---^

after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.

Second question (how to include actions):

You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!

In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.

For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.

Towny answered 24/7, 2012 at 13:12 Comment(0)

H

3

Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.

The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example

text
   = text:non_tag+ {
     // we captured the text in an array and can manipulate it now
     return text.join("");
   }

At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.

Heriot answered 23/7, 2012 at 20:7 Comment(0)

D

1

Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.

text = result:non_tag+ { return result.join(''); }

Detoxify answered 24/7, 2012 at 11:58 Comment(0)

Recommended topics

Hot tags