Why do library designers use ByteString where Text seems to be appropriate?
Asked Answered
M

1

20

Working on my app I've stumbled into a problem of Aeson not decoding UTF8 input. Digging deeper I found out that it relies on Parser ByteString of Attoparsec, which seems to be the source of the problem to me. But it's actually not what I'm asking here about.

The thing is it's not the only place I've seen people using ByteString where, as it seems obvious to me, only Text is appropriate, because JSON is not some binary file, it is a readable text and it may very well contain UTF8 characters.

So I am wondering whether I'm missing something and there are valid reasons to choose ByteString over Text or it is simply a widespread phenomenon of a bad library design caused by majority of people caring less about any other character sets than latin.

Mauritamauritania answered 29/12, 2012 at 10:13 Comment(3)
Note that ByteString precedes Text by quite a few years. No doubt there are a good number of libraries that chose to use ByteString when Text wasn't an option, so it is mistaken to cite them as "bad library design".Hidalgo
@stephentetley I don't understand what you found offensive about those words and the downvote. Anyway I wasn't trying to critisize, but just trying to clear things out. Your remark on the probable historical reasons is helpful.Mauritamauritania
When using Text in api design, you have to be certain you can always rely on the input having UTF8 encoding. I can't tell you how many times I've made that assumption, feeling safe and sound, only to have my program crash down the line for valid input (valid semantically in the problem domain) that happened to exhibit some other encoding. If your interface is in Text, then you have no control inside of your program over encoding anymore. I've found that to be needlessly restrictive in most designs (although admittedly not ALL of them).Garnishee
C
22

I think your problem is just a misunderstanding.

Prelude> print "Ёжик лижет мёд."
"\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076."
Prelude> putStrLn "\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076."
Ёжик лижет мёд.
Prelude> "{\"a\": \"Ёжик лижет мёд.\"}"
"{\"a\": \"\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076.\"}"

When you print a value containing a String, the Show instance for Char is used, and that escapes all characters with code points above 127. To get the glyphs you want, you need to putStr[Ln] the String.

So aeson properly decoded the utf8-encoded input, as should be expected because it utf8-encodes the values itself:

encode = {-# SCC "encode" #-} encodeUtf8 . toLazyText . fromValue .
         {-# SCC "toJSON" #-} toJSON

So to the question why aeson uses ByteString and not Text for the final target of encoding and starting point of decoding.

Because that is the appropriate type. The encoded values are intended to be transferred portably between machines. That happens as a stream of bytes (octets, if we're in pedantic mood). That is exactly what a ByteString provides, a sequence of bytes that then have to be treated in an application-specific way. For the purposes of aeson, the stream of bytes shall be encoded in utf-8, and aeson assumes the input of the decode function is valid utf-8, and encodes its output as valid utf-8.

Transferring e.g. Text would run into portability problems, since a 16-bit encoding depends on endianness, so Text is not an appropriate format for interchange of data between machines. Note that aeson uses Text as an intermediate type when encoding (and presumably also when decoding), because that is an appropriate type to use at intermediate stages.

Convey answered 29/12, 2012 at 17:21 Comment(2)
This makes a whole lot of sense. Thank you! So it looks like I made a false alarm there on the issue tracker.Mauritamauritania
I don't see why it's a parser/emitter's responsibility to pick a format for /my/ network transfers. It is parsing textual data, and Data.Text has functions for making and taking in any of UTF encodings, whereas Aeson is limited to parsing UTF8 encoded bytestrings.Fishbowl

© 2022 - 2024 — McMap. All rights reserved.