Ragel works fine. You just need to be careful about what you're matching. Your question uses both [[tag]]
and {{tag}}
, but your example uses [[tag]]
, so I figure that's what you're trying to treat as special.
What you want to do is eat text until you hit an open-bracket. If that bracket is followed by another bracket, then it's time to start eating lowercase characters till you hit a close-bracket. Since the text in the tag cannot include any bracket, you know that the only non-error character that can follow that close-bracket is another close-bracket. At that point, you're back where you started.
Well, that's a verbatim description of this machine:
tag = '[[' lower+ ']]';
main := (
(any - '[')* # eat text
('[' ^'[' | tag) # try to eat a tag
)*;
The tricky part is, where do you call your actions? I don't claim to have the best answer to that, but here's what I came up with:
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' @PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* @eof(PrintTextNode);
}%%
There are a few non-obvious things:
- The
eof
action is needed because %PrintTextNode
is only ever invoked on leaving a machine. If the input ends with normal text, there will be no input to make it leave that state. Because it will also be called when the input ends with a tag, and there is no final, unprinted text node, PrintTextNode
tests that it has some text to print.
- The
%PrintTextNode
action nestled in after the ^'['
is needed because, though we marked the start when we hit the [
, after we hit a non-[
, we'll start trying to parse anything again and remark the start point. We need to flush those two characters before that happens, hence that action invocation.
The full parser follows. I did it in C because that's what I know, but you should be able to turn it into whatever language you need pretty readily:
/* ragel so_tag.rl && gcc so_tag.c -o so_tag */
#include <stdio.h>
#include <string.h>
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' @PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* @eof(PrintTextNode);
}%%
%% write data;
int
main(void) {
char buffer[4096];
int cs;
char *p = NULL;
char *pe = NULL;
char *eof = NULL;
%% write init;
do {
size_t nread = fread(buffer, 1, sizeof(buffer), stdin);
p = buffer;
pe = p + nread;
if (nread < sizeof(buffer) && feof(stdin)) eof = pe;
%% write exec;
if (eof || cs == %%{ write error; }%%) break;
} while (1);
return 0;
}
Here's some test input:
[[header]]
<html>
<head><title>title</title></head>
<body>
<h1>[[headertext]]</h1>
<p>I am feeling very [[emotion]].</p>
<p>I like brackets: [ is cool. ] is cool. [] are cool. But [[tag]] is special.</p>
</body>
</html>
[[footer]]
And here's the output from the parser:
TAG(header)
TEXT(
<html>
<head><title>title</title></head>
<body>
<h1>)
TAG(headertext)
TEXT(</h1>
<p>I am feeling very )
TAG(emotion)
TEXT(.</p>
<p>I like brackets: )
TEXT([ )
TEXT(is cool. ] is cool. )
TEXT([])
TEXT( are cool. But )
TAG(tag)
TEXT( is special.</p>
</body>
</html>
)
TAG(footer)
TEXT(
)
The final text node contains only the newline at the end of the file.