Haskell string tokenizer function

Asked 25/11, 2011 at 4:14 Answered 26/11, 2011 at 5:22

Solved haskell functional-programming ghc haskell-platform

I needed a String tokenizer in Haskell but there is apparently nothing already defined in the Prelude or other modules. There is splitOn in Data.Text, but that's a pain to use because you need to wrap the String to Text.

The tokenizer is not too hard to do so I wrote one (it doesn't handle multiple adjacent delimiters, but it worked well for what I needed it). I feel something like this should be already in the modules somewhere..

This is my version

tokenizer :: Char -> String -> [String]
tokenizer delim str = tokHelper delim str []

tokHelper :: Char -> String -> [String] -> [String]
tokHelper d s acc 
    | null pos  = reverse (pre:acc)
    | otherwise = tokenizer d (tail pos) (pre:acc)
        where (pre, pos) = span (/=d) s

I searched the internet for more solutions and found some discussions, like this blog post.

The last comment (by Mahee on June 10, 2011) is particularly interesting. Why not make a version of the words function more generic to handle this? I tried searching for such a function but found none..

Is there a simpler way to this or is 'tokenizing' a string not a very recurring problem? :)

Tarrah answered 25/11, 2011 at 4:14 Comment(3)

I'm not a Haskell programmer, so take this with a grain of salt. However, I think the case you are describing is probably considered simple enough to implement, even though it's more complex than words. Most other parsing tasks beyond words-level are probably going to be complex enough to be worth doing with things like parser combinators (e.g. parsec) instead. – Honeyman 25/11, 2011 at 4:25

I would actually recommend for almost all code today that you use Text instead of String. It's a great library with much better performance. If you have more questions about this statement, I'd email the cafe. – Soapstone 25/11, 2011 at 7:2

You are absolutely right about this :) Thank you. I just had some problems using lazy IO a few moments ago (tried for hours to make it work with getContents and getLine, reading on the net, etc). But using the equivalent functions on Text, my problem was solved instantly. I don't know if the performance is worse if I use Text, but in the end there is no noticeable difference in my app. I'll have to research how Text works in more detail :) – Tarrah 28/11, 2011 at 15:52

The split library is what you need. Install with cabal install split, then you have access to a lot of split/tokenizer style functions.

Some examples from the library:

 > import Data.List.Split
 > splitOn "x" "axbxc"
 ["a","b","c"]
 > splitOn "x" "axbxcx"
 ["a","b","c",""]
 > endBy ";" "foo;bar;baz;"
 ["foo","bar","baz"]
 > splitWhen (<0) [1,3,-4,5,7,-9,0,2]
 [[1,3],[5,7],[0,2]]
 > splitOneOf ";.," "foo,bar;baz.glurk"
 ["foo","bar","baz","glurk"]
 > splitEvery 3 ['a'..'z']
 ["abc","def","ghi","jkl","mno","pqr","stu","vwx","yz"]

The wordsBy function from the same library is a generic version of words like you wanted:

wordsBy (=='x') "dogxxxcatxbirdxx" == ["dog","cat","bird"]

Polydactyl answered 25/11, 2011 at 4:29 Comment(0)

If you're parsing a Haskell-like language you can use the lex function from the Prelude: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Prelude.html#v:lex

Hunfredo answered 26/11, 2011 at 5:22 Comment(0)

Recommended topics

Hot tags