I have asked myself the same question. Why is there no standard combinator for nested parsers like how you describe? My default mode is to trust the package author, especially when that author also co-wrote "Real World Haskell". If such an obvious capability is missing, perhaps it is by design and I should look for a better way. However, I've managed to convince myself that such a convenient combinator is mostly harmless. Useful whenever an all-or-nothing type parser is appropriate for the inner parse.
Implementation
import Data.Attoparsec.Text
import qualified Data.Text as T
import Data.Text(Text)
import Control.Applicative
I've divided the required functionality into two parsers. The first, constP
, performs an "in place" parse of some given text. It substitutes the constant parser's fail with empty
(from Alternative), but otherwise has no other side effects.
constP :: Parser a -> Text -> Parser a
constP p t = case parseOnly p t of
Left _ -> empty
Right a -> return a
The second part comes from parseOf
, which performs the constant, inner parse based on the result of the outer parse. The empty
alternative here allows a failed parse to return without consuming any input.
parseOf :: Parser Text -> Parser a -> Parser a
parseOf ptxt pa = bothParse <|> empty
where
bothParse = ptxt >>= constP pa
The block quote markdown can be written in the desired fashion. This implementation requires the resulting block to be totally parsed.
blockQuoteMarkdown :: Parser [[Double]]
blockQuoteMarkdown = parseOf blockQuote ( markdownSurrogate <*
endOfInput
)
Instead of the actual markdown parser, I just implemented a quick parser of space separated doubles. The complication of the parser comes from allowing the last, non-empty line, either end in a new line or not.
markdownSurrogate :: Parser [[Double]]
markdownSurrogate = do
lns <- many (mdLine <* endOfLine)
option lns ((lns ++) . pure <$> mdLine1)
where
mdLine = sepBy double (satisfy (==' '))
mdLine1 = sepBy1 double (satisfy (==' '))
These two parsers are responsible for returning the text internal to block quotes.
blockQuote :: Parser Text
blockQuote = T.unlines <$> many blockLine
blockLine :: Parser Text
blockLine = char '>' *> takeTill isEndOfLine <* endOfLine
Finally, a test of the parser.
parseMain :: IO ()
parseMain = do
putStrLn ""
doParse "a" markdownSurrogate a
doParse "_" markdownSurrogate ""
doParse "b" markdownSurrogate b
doParse "ab" markdownSurrogate ab
doParse "a_b" markdownSurrogate a_b
doParse "badMarkdown x" markdownSurrogate x
doParse "badMarkdown axb" markdownSurrogate axb
putStrLn ""
doParse "BlockQuote ab" blockQuoteMarkdown $ toBlockQuote ab
doParse "BlockQuote a_b" blockQuoteMarkdown $ toBlockQuote a_b
doParse "BlockQuote axb" blockQuoteMarkdown $ toBlockQuote axb
where
a = "7 3 1"
b = "4 4 4"
x = "a b c"
ab = T.unlines [a,b]
a_b = T.unlines [a,"",b]
axb = T.unlines [a,x,b]
doParse desc p str = do
print $ T.concat ["Parsing ",desc,": \"",str,"\""]
let i = parse (p <* endOfInput ) str
print $ feed i ""
toBlockQuote = T.unlines
. map (T.cons '>')
. T.lines
*Main> parseMain
"Parsing a: \"7 3 1\""
Done "" [[7.0,3.0,1.0]]
"Parsing _: \"\""
Done "" []
"Parsing b: \"4 4 4\""
Done "" [[4.0,4.0,4.0]]
"Parsing ab: \"7 3 1\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing a_b: \"7 3 1\n\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing badMarkdown x: \"a b c\""
Fail "a b c" [] "endOfInput"
"Parsing badMarkdown axb: \"7 3 1\na b c\n4 4 4\n\""
Fail "a b c\n4 4 4\n" [] "endOfInput"
"Parsing BlockQuote ab: \">7 3 1\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing BlockQuote a_b: \">7 3 1\n>\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing BlockQuote axb: \">7 3 1\n>a b c\n>4 4 4\n\""
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
Discussion
The notable difference comes in the semantics of failure. For instance, when parsing axb
and blockquoted axb
, which are the following two strings, respectively
7 3 1
a b c
4 4 4
and
> 7 3 1
> a b c
> 4 4 4
the markdown parse results in
Fail "a b c\n4 4 4\n" [] "endOfInput"
whereas the quoted results in
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
The markdown consumes "7 3 1\n", but this is nowhere reported in the quoted failure. Instead, fail becomes all or nothing.
Likewise, there is no allowance for handling unparsed text in the case of partial success. But I don't see a need for this, given the use case. For example, if a parse looked something like the following
"{ <tok> unhandled }more to parse"
where {}
denotes the recognized block quote context, and <tok>
is parsed within that inner context. A partial success then would have to lift "unhandled" out of that block quote context and somehow combine it with "more to parse".
I see no general way to do this, but it is allowed through choice of inner parser return type. For instance, by some parser parseOf blockP innP :: Parser (<tok>,Maybe Text)
. However, if this need arises I would expect that there is a better way to handle the situation than with nested parsers.
There may also be concerns about the loss of attoparsec Partial parsing. That is, the implementation of constP
uses parseOnly
, which collapses the parse return Fail
and Partial
into a single Left
failure state. In other words, we lose the ability to feed our inner parser with more text as it becomes available. However, note that text to parse is itself the result of an outer parse; it will only be available after enough text has been fed to the outer parse. So this shouldn't be an issue either.
Parsec
is : it gives you the impression you don't need a lexer and you might not in theory. In your case, it seems that 'tokenizing' your file first will solve the problem. You just need to create on token for a BlockQuote. I might be wrong. – Lovelorn