Working example of wikitext-to-HTML in ANTLR 3

grammar wikitext; options { //output = AST; //ASTLabelType = CommonTree; output = template; language = Java; } document: line (NL line?)*; line: horizontal_line | list | heading | paragraph; /* horizontal line */ horizontal_line: HRLINE; /* lists */ list: unordered_list | ordered_list; unordered_list: '*'+ content; ordered_list: '#'+ content; /* Headings */ heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6; heading1: H1 plain H1; heading2: H2 plain H2; heading3: H3 plain H3; heading4: H4 plain H4; heading5: H5 plain H5; heading6: H6 plain H6; /* Paragraph */ paragraph: content; content: (formatted | link)+; /* links */ link: external_link | internal_link; external_link: '[' external_link_uri ('|' external_link_title)? ']'; internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ; external_link_uri: CHARACTER+; external_link_title: plain; internal_link_ref: plain; internal_link_title: plain; /* bold & italic */ formatted: bold_italic | bold | italic | plain; bold_italic: BOLD_ITALIC plain BOLD_ITALIC; bold: BOLD plain BOLD; italic: ITALIC plain ITALIC; /* Plain text */ plain: (CHARACTER | SPACE)+; /** * LEXER RULES * -------------------------------------------------------------------------- */ HRLINE: '---' '-'+; H1: '='; H2: '=='; H3: '==='; H4: '===='; H5: '====='; H6: '======'; BOLD_ITALIC: '\'\'\'\'\''; BOLD: '\'\'\''; ITALIC: '\'\''; NL: '\r'?'\n'; CHARACTER : '!' | '"' | '#' | '$' | '%' | '&' | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~' | '0'..'9' | 'A'..'Z' |'a'..'z' | '\u0080'..'\u7fff' | '(' | ')' | '\'' | '<' | '>' | '=' | '[' | ']' | '|' ; SPACE: ' ' | '\t';

Okay, after your EDIT, I have a couple of recommendations.

Like I said in the comments, writing a grammar for such a language is nearly impossible. At least, trying to do so in one go, that is. The only way I see this working would be to do this with multiple parsers where the first "parsing-stage" would parse the wiki-source very "coarsely". For example: a table would be tokenized as: TABLE : '{|' .* '|}' and then you'd create another parser that parses this table properly. Doing it in one parser will result in quite a few ambiguities in your parser rules IMO.

About emitting HTML code, the "proper" way to do this is indeed with StringTemplate, but given the fact that you're rather new to ANTLR itself, I'd keep things simple. You could create a StringBuilder attribute in your parser class that would collect all your HTML code as you parse your source file. You can embed code in ANTLR rules by wrapping it with { and }.

Here's a quick demo:

grammar T;

@parser::members {

  // an attribute that is only available in your 
  // parser (so only in parser rules!)
  protected StringBuilder htmlBuilder = new StringBuilder();
}

// Parser rules
parse
  :  atom+ EOF
  ;

atom
  :  header
  |  Any    {htmlBuilder.append($Any.text);} // append the text from 'Any' token
  ;

header
  :  H3 h3Content H3 {htmlBuilder.append("<h3>" + $h3Content.text + "</h3>");}
  |  H2 h2Content H2 {htmlBuilder.append("<h2>" + $h2Content.text + "</h2>");}
  |  H1 h1Content H1 {htmlBuilder.append("<h1>" + $h1Content.text + "</h1>");}
  ;

h3Content : ~H3*; // match any token except H3, zero or more times
h2Content : ~H2*; //        "               H2          "
h1Content : ~H1*; //        "               H1          "

// Lexer rules    
H3 : '===';
H2 : '==';
H1 : '=';

// Fall through rule: if non of the above 
// lexer rules matched, this one will.
Any
  :  .
  ;

From that grammar, you generate a parser and lexer:

java -cp antlr-3.2.jar org.antlr.Tool T.g

and then create a little class to test your parser:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {

        // the source to be parsed
        String source = 
                "= header 1 =             \n"+
                "                         \n"+
                "some text here           \n"+
                "                         \n"+
                "=== header level 3 ===   \n"+
                "                         \n"+
                "and some more text         ";

        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);

        // invoke the start-rule in your parser
        parser.parse();

        // print the contents of your parser's StringBuilder
        System.out.println(parser.htmlBuilder);
    }
}

and then compile all your source files:

javac -cp antlr-3.2.jar *.java

and finally, run your main class

// *nix & MacOS
java -cp .:antlr-3.2.jar Main

// Windows
java -cp .;antlr-3.2.jar Main

which will print the following to the console:

<h1> header 1 </h1>             

some text here           

<h3> header level 3 </h3>   

and some more text

But, again, if you are free to choose a different language to parse, I'd do that and forget about parsing this horrible Wiki-thing.

Anyway, whatever you do: best of luck!

Recommended topics

Hot tags