anyBetween start end = start *> anyTill end
Your anyBetween
parser eats its last character because anyTill
does - it's designed to parse upto an end marker, but assuming you didn't want to keep the closing brace in the input to parse again.
Notice that your end
parsers are all single character parsers, so we can change the functionality to make use of this:
anyBetween'' start ends = start *> many (satisfy (not.flip elem ends))
but many
isn't as efficient as Attoparsec's takeWhile
, which you should use as much as possible, so if you've done
import qualified Data.Attoparsec.Text as A
then
anyBetween' start ends = start *> A.takeWhile (not.flip elem ends)
should do the trick, and we can rewrite
styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>']
If you want it to eat the ' '
but not the '>'
you can explicitly eat spaces afterwards:
styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>']
<* A.takeWhile isSpace
Going for more takeWhile
Perhaps styleWithQuotes
could do with a rewrite to use takeWhile
as well, so let's make two helpers on the lines of anyBetween
. They take from a starting parser up to an ending character, and there's inclusive and exclusive versions:
fromUptoExcl startP endChars = startP *> takeTill (flip elem endChars)
fromUptoIncl startP endChars = startP *> takeTill (flip elem endChars) <* anyChar
But I think from what you said, you want styleWithoutQuotes
to be a hybrid; it eats ' '
but not >
:
fromUptoEat startP endChars eatChars =
startP
*> takeTill (flip elem endChars)
<* satisfy (flip elem eatChars)
(All of these assume a small number of characters in your end character lists, otherwise elem
isn't efficient - there are some Set
variants if you're checking against a big list like an alphabet.)
Now for the rewrite:
styleWithQuotes' = fromUptoIncl (stringCI "style=\"") "\""
styleWithoutQuotes' = fromUptoEat (stringCI "style=") " >" " "
The overall parser
everythingButStyles
uses <|>
in a way that means that if it doesn't find "style"
it will backtrack then take everything. This is an example of the sort of thing which can be slow. The problem is that we fail late - at the end of the input string, which is a bad time to make a choice about whether we should fail. Let's go all out and try to
- Fail straight away if we're going to fail.
- Maximise use of the faster parsers from Data.Attoparsec.Text.Internal
Idea: take until we get an s, then skip the style if there's one there.
notStyleNotEvenS = takeTill (flip elem "sS")
skipAnyStyle = (styleWithQuotes' <|> styleWithoutQuotes') *> notStyleNotEvenS
<|> cons <$> anyChar <*> notStyleNotEvenS
The anyChar
is usually an s
or S
, but there's no sense checking that again.
noStyles = append <$> notStyleNotEvenS <*> many skipAnyStyle
parseNoStyles = parseOnly noStyles
ByteString
orText
) you're using is unclear, but, in case you're using Text, you should useasciiCI
rather thanstringCI
; the latter is now deprecated. – Elbrus