Can anyone explain the pros and cons to using Data.Text
and Data.ByteString.Char8
data types? Does working with ASCII-only text change these pros and cons? Do their lazy variants change the story as well?
Data.ByteString.Char8
provides functions to treat ByteString
values as sequences of 8-bit ASCII characters, while Data.Text
is an independent type supporting the entirety of Unicode.
ByteString
and Text
are essentially the same, as far as representation goes — strict, unboxed arrays with lazy variants based on lists of strict chunks. The main difference is that ByteString
stores octets (i.e. Word8
s), while Text
stores Char
s, encoded in UTF-16.
If you're working with ASCII-only text, then using Data.ByteString.Char8
will probably be faster than Text
, and use less memory; however, you should ask yourself whether you're really sure that you're only ever going to work with ASCII. Basically, in 99% of cases, using Data.ByteString.Char8
over Text
is a speed hack — octets aren't characters, and any Haskeller can agree that using the correct type should be prioritised over raw, bare-metal speed. You should usually only consider it if you've profiled the program and it's a bottleneck. Text
is well-optimised, and the difference will probably be negligible in most cases.
Of course, there are non-speed-related situations in which Data.ByteString.Char8
is warranted. Consider a file containing data that is essentially binary, not text, but separated into lines; using lines
is completely reasonable. Additionally, it's entirely conceivable that an integer might be encoded in ASCII decimal in the context of a binary format; using readInt
would make perfect sense in that case.
So, basically:
Data.ByteString.Char8
: For pure ASCII situations where performance is paramount, and to handle "almost-binary" data that has some ASCII components.Data.Text
: Text, including any situation where there's the slightest possibility of something other than ASCII being used.
Data.ByteString.Char8
, then, as you'll essentially be dealing with a binary format that only resembles text. (I'd also recommend checking out attoparsec for parsing the files.) –
Bloem Text
will encode each character as two bytes, but ByteString
will encode them as one. If you're currently using String
, though, I wouldn't worry too much about it; String
has huge overhead (5 words per character(!)), far more than the other two. See this summary of memory footprints. –
Bloem String
benefits from sharing, while ByteString
and Text
, as unboxed arrays, don't; however, ByteString
and Text
both take substrings without copying, and they're just so much smaller to start with that you'd have to try pretty hard to make that disadvantage matter. –
Bloem © 2022 - 2024 — McMap. All rights reserved.