I'm trying to parse some information out of largely free-form text. I attempted an implementation in FParsec, but I haven't used it before and I'm not sure if I'm doing it wrong, or even if it is well-suited to this particular problem.
Problem description
I want to parse out the contents of a particular set of Liquid tags from a markdown document ("examplecode" and "requiredcode" tags). The markdown will be mainly free-form text with the occasional block within Liquid tags, for example:
Some free form text.
Possibly lots of lines. Maybe `code` stuff.
{% examplecode opt-lang-tag %}
ABC
DEF
{% endexamplecode %}
More text. Possibly multilines.
{% othertag %}
can ignore this tag
{% endothertag %}
{% requiredcode %}
GHI
{% endrequiredcode %}
In this case I need to parse out [ "ABC\nDEF"; "GHI" ]
.
The parsing logic I'm after can be expressed imperatively. Loop through each line, if we find a start tag we're interested in, take lines until we match the closing tag and add those lines to the list of results, otherwise skip lines until the next start tag. Repeat.
This can be done with a loop or fold, or with a regular expression:
\{%\s*(examplecode|requiredcode).*\%}(.*?)\{%\s*end\1\s*%\}
My FParsec attempt
I found it difficult to express the logic above in FParsec. I wanted to write something like between s t (everythingUntil t)
, but I don't know how to implement that without everythingUntil
consuming the end token, causing between
to fail.
I ended up with the following, which doesn't handle nested occurrences of "{%"
, but seems to pass the main test cases I care about:
let trimStr (s : string) = s.Trim()
let betweenStr s t = between (pstring s) (pstring t)
let allTill s = charsTillString s false maxInt
let skipAllTill s = skipCharsTillString s false maxInt
let word : Parser<string, unit> = many1Satisfy (not << Char.IsWhiteSpace)
type LiquidTag = private LiquidTag of name : string * contents : string
let makeTag n c = LiquidTag (n, trimStr c)
let liquidTag =
let pStartTag = betweenStr "{%" "%}" (spaces >>. word .>> spaces .>> skipAllTill "%}")
let pEndTag tagName = betweenStr "{%" "%}" (spaces >>. pstring ("end" + tagName) .>> spaces)
let tagContents = allTill "{%"
pStartTag >>= fun name ->
tagContents
.>> pEndTag name
|>> makeTag name
let tags = many (skipAllTill "{%" >>. liquidTag)
I can then filter tags to only include the ones I'm interested in.
This does a lot more than a basic implementation (like a regex) does, such as descriptive error reporting and more strict validation of input format (this is both good and bad).
One consequence of the stricter format is parsing fails on nested "{%"
substrings within tags. I'm not sure how I'd adjust it to handle this case (should give [ "ABC {% DEF " ]
):
{% examplecode %}
ABC {% DEF
{% endexamplecode %}
Question
Is there a way to more closely express the logic described in the "Problem description" section in FParsec, or does the free-form nature of the input make FParsec less suited to this than a more basic loop or regex?
(I'm also interested in ways to allow nested "{%"
strings in tags, and improvements to my FParsec attempt. I'm happy to split that out into other questions as required.)
{%
and%}
. – Coexist