Implementing Read typeclass where parsing strings includes "$"

Asked 15/9, 2011 at 23:26 Answered 15/9, 2011 at 23:57

I've been playing with Haskell for about a month. For my first "real" Haskell project I'm writing a parts-of-speech tagger. As part of this project I have a type called Tag that represents a parts-of-speech tag, implemented as follows:

data Tag = CC | CD | DT | EX | FW | IN | JJ | JJR | JJS ...

The above is a long list of standardized parts-of-speech tags which I've intentionally truncated. However, in this standard set of tags there are two that end in a dollar sign ($): PRP$ and NNP$. Because I can't have type constructors with $ in their name, I've elected to rename them PRPS and NNPS.

This is all well and good, but I'd like to read tags from strings in a lexicon and convert them to my Tag type. Trying this fails:

instance Read Tag where
    readsPrec _ input =
        (\inp -> [((NNPS), rest) | ("NNP$", rest) <- lex inp]) input

The Haskell lexer chokes on the $. Any ideas how to pull this off?

Implementing Show was fairly straightforward. It would be great if there were some similar strategy for Read.

instance Show Tag where
    showsPrec _ NNPS = showString "NNP$"
    showsPrec _ PRPS = showString "PRP$"
    showsPrec _ tag  = shows tag

Jailer answered 15/9, 2011 at 23:26 Comment(1)

Pretty much the only time you should be writing your own Show and Read instances, instead of using the instances that get derived automatically, is if your data type hides its internal representation (like Data.Set.Set and such, which spit out a fromList call) or works with literals, e.g. an instance of Num spitting out an integer literal it corresponds to. – Tasse 15/9, 2011 at 23:56

You're abusing Read here.

Show and Read are meant to print and parse valid Haskell values, to enable debugging, etc. This doesn't always perfectly (e.g. if you import Data.Map qualified and then call show on a Map value, the call to fromList isn't qualified) but it's a valid starting point.

If you want to print or parse your values to match some specific format, then use a pretty-printing library for the former and an actual parsing library (e.g. uu-parsinglib, polyparse, parsec, etc.) for the latter. They typically have much nicer support for parsing than ReadS (though ReadP in GHC isn't too bad).

Whilst you may argue that this isn't necessary, this is just a quick'n'dirty hack you're doing, quick'n'dirty hacks have a tendency to linger around... do yourself a favour and do it right the first time: it means there's less to re-write when you want to do it "properly" later on.

Snowy answered 15/9, 2011 at 23:43 Comment(2)

Thanks for your answer. And here I was thinking this was the proper way to do things, otherwise I wouldn't have bothered with a Read parser at all (the rows of the lexicon are nicely formatted and broken up using the standard words function)! Coming from OOP, I guess I'm still thinking of typeclasses as interfaces I must implement to get the behavior I need. – Jailer 15/9, 2011 at 23:52

Specifically, Read and Show are intended to be a matching set of poor man's serialization/deserialization to and from String, with the additional expectation that the serialized form, if cut and paste into the original source file, would represent a value equivalent to the one show was applied to. – Tasse 15/9, 2011 at 23:52

Don't use the Haskell lexer then. The read functions use ParSec, which you can find an excellent introduction to in the Real World Haskell book.

Here's some code that seems to work,

import Text.Read
import Text.ParserCombinators.ReadP hiding (choice)
import Text.ParserCombinators.ReadPrec hiding (choice)

data Tag = CC | CD | DT | EX | FW | IN | JJ | JJR | JJS deriving (Show)

strValMap = map (\(x, y) -> lift $ string x >> return y)

instance Read Tag where
    readPrec = choice $ strValMap [
        ("CC", CC),
        ("CD", CD),
        ("JJ$", JJS)
        ]

just run it with

(read "JJ$") :: Tag

The code is pretty self explanatory. The string x parser monad matches x, and if it succeeds (doesn't throw an exception), then y is returned. We use choice to select among all of these. It will backtrack appropriately, so if you add a CCC constructor, then CC partially matching "CCC" will fail later, and it will backtrack to CCC. Of course, if you don't need this, then use the <|> combinator.

Woald answered 15/9, 2011 at 23:57 Comment(2)

Thanks. This is pretty much exactly what I was trying to accomplish. – Jailer 16/9, 2011 at 0:24

@gatoatigrado: actually, the Read functions do not use parsec: they have their own parser. – Snowy 16/9, 2011 at 1:9

Recommended topics

Hot tags