Parsing BBCode with xslt 2.0

Asked 8/12, 2009 at 22:45 Answered 8/12, 2009 at 23:25

I need help finding a viable solution to convert bbcode to html, this is where ive come so far, but fails when bbcodes get wrapped.

Src:

 [quote id="ohoh81"]asdasda
     [quote id="ohoh80"]adsad
         [quote id="ohoh79"]asdad[/quote]
     [/quote]
 [/quote]

Code:

<xsl:variable name="rules">
    <code check="&#xD;" >&lt;br/&gt;</code>
    <code check="\&#91;(quote)(.*)\&#93;" >&lt;span class=&#34;quote&#34;&gt;</code>
</xsl:variable>

<xsl:template match="text()" mode="BBCODE">
  <xsl:call-template name="REPLACE_EM_ALL">
    <xsl:with-param name="text" select="." />
    <xsl:with-param name="pos" select="number(1)" />
  </xsl:call-template>
</xsl:template>

<xsl:template name="REPLACE_EM_ALL">
  <xsl:param name="text" />
  <xsl:param name="pos" />
  <xsl:variable name="newText" select="replace($text, ($rules/code[$pos]/@check), ($rules/code[$pos]))" />
  <xsl:choose>
    <xsl:when test="$rules/code[$pos +1]">
      <xsl:call-template name="REPLACE_EM_ALL">
        <xsl:with-param name="text" select="$newText" />
        <xsl:with-param name="pos" select="$pos+1" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of disable-output-escaping="yes" select="$newText" />
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Fico answered 8/12, 2009 at 22:45 Comment(0)

I think a more viable approach would be to repeatedly match and replace (via regex) pairs of BBcode tags, until you get no matches. E.g. for [quote] and [url]:

<xsl:function name="my:bbcode-to-xhtml" as="node()*">
  <xsl:param name="bbcode" as="xs:string"/> 
  <xsl:analyze-string select="$bbcode" regex="(\[quote\](.*)\[/quote\])|(\[url=(.*?)\](.*)\[/url\])" flags="s">
    <xsl:matching-substring>
      <xsl:choose>
        <xsl:when test="regex-group(1)"> <!-- [quote] -->
          <span class="quote">
            <xsl:value-of select="my:bbcode-to-xhtml(regex-group(2))"/>
          </span>
        </xsl:when>
        <xsl:when test="regex-group(3)"> <!-- [url] -->
          <a href="regex-group(4)">
            <xsl:value-of select="my:bbcode-to-xhtml(regex-group(5))"/>
          </a>
        </xsl:when>
      </xsl:choose>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:function>

Situla answered 8/12, 2009 at 23:25 Comment(8)

I thought it was commonly agreed here not to recommend regular expressions to parse structured languages. ;) – Slushy 9/12, 2009 at 10:7

It will in practice only match a couple of times, so this will be fine. Thanx again Pavel – Fico 9/12, 2009 at 11:32

The regex-group should be regex-group(1) with the above regex. Works in prod btw. – Fico 9/12, 2009 at 15:44

@Tomalak: I generally recommend against parsing XML with regex simply because it's very hard to get it right for all corner cases (DOCTYPE, character entities, CDATA, correct invalid input handling, and so on). BBcode is much simpler - in fact, I strongly suspect it was invented by a lazy dev who didn't want to parse something XML-like, so came up with a scheme that's easier to deal with. Besides, analyze-string seems to be specifically geared at parsing text streams (that's why it repeatedly applies the regex, after all). – Situla 9/12, 2009 at 16:51

Sure, but as soon as it gets to attributes (BBCode can have them, AFAIK) or other things that break nesting for a regex, it will fail just as badly as it will fail for XML/HTML/etc. Admittedly: As long as it does not get any more complex than the OP shows in his example (no attributes, no nested comments or HTML), and all BBcode tags are guaranteed to be nested correctly, a regex based approach can work. But it will still be the weak point of the whole construct. – Slushy 9/12, 2009 at 18:40

Btw, Pavel - My example shows a recursive template to do many different regexes, do you have a recomended easy way of doing this with your function, or should i just make a better general regex, and give tags its value from its match? – Fico 9/12, 2009 at 19:46

BBcode isn't guaranteed to nest correctly, but the traditional way of handling it is to process the pairs that match, and leave the unmatched ones be, which is what will happen here. I haven't seen attributes proper either, though you can get something like [url=http://...]...[/url], or same thing for [quote] - but this is also trivial to deal with as there are no quotes, and no escape chars. – Situla 9/12, 2009 at 20:30

@Sveisvei: I've edited the answer to demonstrate how mixed [quote] and [url] can be nested, and to show how to extract the parameter in URL and use it. – Situla 9/12, 2009 at 20:33

This is probably a bad idea because XSLT is designed to handle well-formed XML, not arbitrary text. I'd suggest you preprocess the BBCode first to replace the left and right brackets with < and >, do whatever else you need to to make it well-formed XML, and then process it with XSL.

Hit answered 8/12, 2009 at 23:9 Comment(5)

XSLT 2.0 has xsl:analyze-text instruction, which is pretty awesome for processing of non-XML text. – Situla 8/12, 2009 at 23:14

True, but it's still not intended to process a complete non-XML input file. I certainly wouldn't try to do this task purely in XSLT because a general parser for all of BBCode in XSL would be very complex and hard to maintain. BBCode is close enough to XML structure that it would be far easier to represent it as XML and then use the full power of XSLT to convert to XHTML. – Hit 8/12, 2009 at 23:40

@Jim: +1. XSLT is a wonderful tool for XML transformation, it's not an all-purpose language... – Rebirth 9/12, 2009 at 9:13

of course XSLT is a domain specific language, but it is meant for data transformation (originally only XML), but XSLT 2.0 left that path and expanded to allow input of multiple XML and (unicode) text files with unparsed-text() and unparsed-text-available(). So: it's still a data-transformation language, but definitely not meant to be bound to XML alone (luckily so, as many problem domains include XML and text input). – Eurhythmic 23/8, 2010 at 21:40

PS: a variant of your suggestion, pre-processing it with XSLT and then post-processing it with XSLT in a micro-pipeline, is a common coding pattern in the XSLT world. – Eurhythmic 23/8, 2010 at 21:42

Recommended topics

Hot tags