How to parse nested BB Code with parameters
Asked Answered
S

5

9

I'd like to work on a BB Code filter for a PHP website. (I'm using CakePHP, it would be a BB Code helper). I have the following requirements:

  • BB Code can be nested. So something like this is valid

    [block]  
        [block]  
        [/block]  
        [block]  
            [block]  
            [/block]  
        [/block]  
    [/block]  
    
  • Bbcodes can have 0 or more parameters.

    Example:

    [video: url="url", width="500", height="500"]Title[/video]
    
  • BB Code might have multiple behaviours

    Let's say [url]text[/url] would be transformed to [url:url="text"]text[/url] or the video BB Code would be able to choose between YouTube, Dailymotion, etc.

I've already done something with regex, but my biggest problem was matching parameters. In fact, I got nested BB Code and BB Code with 0 parameters to work. But when I added a regex match for parameters, it didn't match nested BB Code correctly:

"\[($tag)(=.*)\"\](.*)\[\/\1\]" (It wasn't .* but the non-greedy matcher)

I don't have the complete regex with me right now, But I had something that looked like that(above).

Is there a way to match BB Code with regex or something else?

The only thing I can think of is to use the visitor pattern and to split my text with each possible tags. This way, I can have a bit more of control over my text parsing and I could probably validate my document so if the input text doesn't have valid BB Code... I could notify the user with a error before saving anything.

I would use SableCC to create my text parser.

Shelled answered 28/1, 2009 at 19:21 Comment(0)
C
5

There's both a pecl and PEAR BBCode parsing library. Software's hard enough without reinventing years of work on your own.

If neither of those are an option, I'd concentrate on turning the BBCode into a valid XML string, and then using your favorite XML parsing routine on that. Very very rough idea here, but

  1. Run the code through htmlspecialchars to escape any entities that need escaping

  2. Transform all [ and ] characters into < and > respectively

  3. Don't forget to account for the colon in cases like [tagname:

If the BBCode was nested properly, you should be all set to pass this string into an XML parsing object (SimpleXML, DOMDocument, etc.)

Careworn answered 28/1, 2009 at 21:6 Comment(3)
That's a horrible idea. What would [script] ... [/script] do?Bambara
Yeah, that's pretty awful if you're planning on outputting HTML back. What I wrote was assuming you're parsing the BBCode to pull out information. If you're using anything but official BBCode parsers (mentioned in the first paragraph) you're bound to leave yourself open to a XSS attack.Careworn
@AlanStorm I wouldn't say that. Parsing bbcode as xml like markup is actually a good idea and less prone to xss attack unless you aren't actually parsing the content and just replacing tags to html tags. Which isn't really the point here. You don't need an xml parser to replace '[' by '<'. But extending bbcode through xml parsers makes lot of sense. It lets you define strict rules on what to do when finding an object and then you can output it back to html and anything that isn't safe can be easily filtered withing your "pseudo DOM" objects.Preterhuman
M
8

There are several existing libraries for parsing BBCode, it may be easier to look into those than trying to roll your own:

Here's a couple, I'm sure there are more if you look around:
PECL bbcode
PEAR HTML_BBCodeParser

Mediterranean answered 28/1, 2009 at 19:36 Comment(0)
J
8

Most BB Code parsers use regex and PHP 4 and produce errors on PHP 5.2+ or don't work at all.

PECL bbcode and PEAR HTML_BBCodeParser don't appear to be maintained anymore (late 2012) and aren't easily installed on the shared hosting setup I have to work with.

StringParser_BBCode works with some minor tweaks for 5.2+ but the method for adding new tags is clumsy, and it was last updated in 2008.

Buried on the 4th page of a Bing search, I found jBBCode, which appears new, requires PHP 5.3, and is under the MIT License. I have yet to try building custom tags, but so far it is the only one I've tried that works out of the box on a shared hosting account with PHP 5.3.

Jaquesdalcroze answered 18/10, 2012 at 4:46 Comment(1)
this post is quite old and to be honest I'm amazed that it still relevant. If I had to implement it again. I wouldn't do it using regexes. BBCode can be quite similar to html since it's a markup language using brackets instead of < and >. I'd probably adapt a xml parser to check for [ and ] instead. This way you get all the benifit of xml inside bbcode without much problem. While parsing the bbcode, you can do almost anything.Preterhuman
C
5

There's both a pecl and PEAR BBCode parsing library. Software's hard enough without reinventing years of work on your own.

If neither of those are an option, I'd concentrate on turning the BBCode into a valid XML string, and then using your favorite XML parsing routine on that. Very very rough idea here, but

  1. Run the code through htmlspecialchars to escape any entities that need escaping

  2. Transform all [ and ] characters into < and > respectively

  3. Don't forget to account for the colon in cases like [tagname:

If the BBCode was nested properly, you should be all set to pass this string into an XML parsing object (SimpleXML, DOMDocument, etc.)

Careworn answered 28/1, 2009 at 21:6 Comment(3)
That's a horrible idea. What would [script] ... [/script] do?Bambara
Yeah, that's pretty awful if you're planning on outputting HTML back. What I wrote was assuming you're parsing the BBCode to pull out information. If you're using anything but official BBCode parsers (mentioned in the first paragraph) you're bound to leave yourself open to a XSS attack.Careworn
@AlanStorm I wouldn't say that. Parsing bbcode as xml like markup is actually a good idea and less prone to xss attack unless you aren't actually parsing the content and just replacing tags to html tags. Which isn't really the point here. You don't need an xml parser to replace '[' by '<'. But extending bbcode through xml parsers makes lot of sense. It lets you define strict rules on what to do when finding an object and then you can output it back to html and anything that isn't safe can be easily filtered withing your "pseudo DOM" objects.Preterhuman
S
3

We recently looked at going the BB Code route and decided on using htmlpurifier instead.

This decision was based in part on the (admittedly biased) comparisons between various methods listed by the htmlpurifier group here and their discussion of BB Code here.

Slideaction answered 28/1, 2009 at 19:32 Comment(1)
Ah thank you, I'll probably include html purifier. But because i'm not really a fan of things like fck editor. I'd say that it will mostly be used to purify the html output. But it looks very nice.Preterhuman
R
2

Use preg_split() with PREG_DELIM_CAPTURE flag to split source code into tags and non-tags. Then iterate over tags keeping stack of open blocks (i.e. when you see opening tag, add it to an array. When you see closing tag, remove elements from end of the array until closing tag matches opening tag.)

Ravens answered 9/3, 2010 at 20:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.