Matching bytestrings in Parsec
Asked Answered
S

2

4

I am currently trying to use the Full CSV Parser presented in Real World Haskell. In order to I tried to modify the code to use ByteString instead of String, but there is a string combinator which just works with String.

Is there a Parsec combinator similar to string that works with ByteString, without having to do conversions back and forth?

I've seen there is an alternative parser that handles ByteString: attoparsec, but I would prefer to stick with Parsec, since I'm just learning how to use it.

Strader answered 16/3, 2013 at 21:49 Comment(5)
attoparsec is a lot faster and has an API very similar to Parsec, so I'd really recommend you to look into it!Boleyn
I highly recommend you use the cassava library to parse CSVs. It's very fast (it uses attoparsec internally) and it's very easy to use.Hyps
I assume you mean on the return side? You'll have to do your own packing afterwards, ie. pack $ string "foobar" (remembering to think about encoding.) Other than that, Parsecs string works just fine on ByteStrings on the input side.Divider
I'll certainly run some experiments using attoparsec and Cassava. However, my goal was to check how to use Parsec. CSV parsing is just a means to that end (for now). Thanks for sharing all this information!Strader
About the packing, I wanted to avoid it. I saw that Parsec handles ByteString input well, so yes: I meant the "return side".Strader
M
5

I'm assuming you're starting with something like

import Prelude hiding (getContents, putStrLn)
import Data.ByteString
import Text.Parsec.ByteString

Here's what I've got so far. There are two versions. Both compile. Probably neither is exactly what you want, but they should aid discussion and help you to clarify your question.

Something I noticed along the way:

  • If you import Text.Parsec.ByteString then this uses uncons from Data.ByteString.Char8, which in turn uses w2c from Data.ByteString.Internal, to convert all read bytes to Chars. This enables Parsec's line and column number error reporting to work sensibly, and also enables you to use string and friends without problem.

Thus, the easy version of the CSV parser, which does exactly that:

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (getContents)

import Text.Parsec
import Text.Parsec.ByteString

csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell (char ',')
cell :: Parser String
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser String
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

quotedChar :: Parser Char
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return '"')

eol :: Parser String
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[String]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r

But this was so trivial to get working that I assume it cannot possibly be what you want. Perhaps you want everything to remain a ByteString or [Word8] or something similar all the way through? Hence my second attempt below. I am still importing Text.Parsec.ByteString, which may be a mistake, and the code is hopelessly riddled with conversions.

But, it compiles and has complete type annotations, and therefore should make a sound starting point.

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import Control.Monad (liftM)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (pack, getContents)
import qualified Data.ByteString.Char8 as Char8 (pack)

import Data.Word (Word8)
import Data.ByteString.Internal (c2w)

import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many)
import Text.Parsec.ByteString
import Text.Parsec.Prim (tokens, tokenPrim)
import Text.Parsec.Pos (updatePosChar, updatePosString)
import Text.Parsec.Error (ParseError)

csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell (char ',')
cell :: Parser ByteString
cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r"))

quotedCell :: Parser ByteString
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return (ByteString.pack content)

quotedChar :: Parser Word8
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return (c2w '"'))

eol :: Parser ByteString
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[ByteString]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r

-- replacements for some of the functions in the Parsec library

noneOf :: String -> Parser Word8
noneOf cs   = satisfy (\b -> b `notElem` [c2w c | c <- cs])

char :: Char -> Parser Word8
char c      = byte (c2w c)

byte :: Word8 -> Parser Word8
byte c      = satisfy (==c)  <?> show [c]

satisfy :: (Word8 -> Bool) -> Parser Word8
satisfy f   = tokenPrim (\c -> show [c])
                        (\pos c _cs -> updatePosChar pos c)
                        (\c -> if f (c2w c) then Just (c2w c) else Nothing)

string :: String -> Parser ByteString
string s    = liftM Char8.pack (tokens show updatePosString s)

Probably your concern, efficiency-wise, should be those two ByteString.pack instructions, in the definitions of cell and quotedCell. You might try to replace the Text.Parsec.ByteString module so that instead of “making strict ByteStrings an instance of Stream with Char token type,” you make ByteStrings an instance of Stream with Word8 token type, but this won't help you with efficiency, it will just give you a headache trying to reimplement all the sourcePos functions to keep track of your position in the input for error messages.

No, the way to make it more efficient would be to change the types of char, quotedChar and string to Parser [Word8] and the types of line and csvFile to Parser [[Word8]] and Parser [[[Word8]]] respectively. You could even change the type of eol to Parser (). The necessary changes would look something like this:

cell :: Parser [Word8]
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser [Word8]
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

string :: String -> Parser [Word8]
string s    = [c2w c | c <- (tokens show updatePosString s)]

You don't need to worry about all the calls to c2w as far as efficiency is concerned, because they cost nothing.

If this doesn't answer your question, please say what would.

Mildew answered 22/3, 2013 at 17:0 Comment(5)
Wow! Your answer is very thorough. Indeed, that'll do.Strader
Beetle, your answer imports so many different ByteString modules, I'm very confused as to which ones are being used. Could you clarify it more?Disorganize
@tolgap: Firstly, you'll need to be looking at Chapter 16 of the book. All imports are necessary, so I'm using all of them. My goal here was just to create something that would compile, so that if the author got frustrated, he could look at the differences between his code and mine. The explanations are in the documentation for the various modules. I can't remember any specific details from when I wrote this answer two years ago.Mildew
@tolgap: There shouldn't be any ambiguity. For example, getContents is imported from Data.ByteString and hidden from Prelude. The two pack functions are imported as ByteString.pack and Char8.pack. In fact, the only time I haven't explicitly named what functions are imported is with Text.Parsec.ByteString, and from its documentation I can see that that's because it doesn't export anything except parseFromFile (which I'm not using) and the Parser type (which I am). What's confusing you?Mildew
Thank you beetle, the book and documentation links are very helpful.Disorganize
N
0

I don't believe so. You will need to create one yourself using tokens. Although the documentation for it is a bit... nonexistent, the first two arguments are a function to use to show the expected tokens in an error message and a function to update the source position that will be printed in errors.

Naoise answered 16/3, 2013 at 22:30 Comment(2)
Precisely, the tokens function isn't documented, which is why I didn't attempt to use it directly. I'll take a look into it some other time, though.Strader
@ArielD.MoyaSequeira I just told you what the documentation would say if it was there...Naoise

© 2022 - 2024 — McMap. All rights reserved.