Convert Between Latin1-encoded Data.ByteString and Data.Text
Asked Answered
T

1

5

Since the latin-1 (aka ISO-8859-1) character set is embedded in the Unicode character set as its lowest 256 code-points, I'd expect the conversion to be trivial, but I didn't see any latin-1 encoding conversion functions in Data.Text.Encoding which contains only conversion functions for the common UTF encodings.

What's the recommended and/or efficient way to convert between Data.ByteString values encoded in latin-1 representation and Data.Text values?

Triatomic answered 25/9, 2011 at 10:23 Comment(2)
By the way, the assumption that "since the latin-1 character set is embedded in the Unicode character set as its lowest 256 code-points, I'd expect the conversion to be trivial" is unwarranted. There is no reason to expect that the bytestreams resulting from encoding a single codepoint stream in two different encodings should have a trivial relationship to each other.Phonemic
@DanielWagner: Yes, I'm aware that in the general case I shouldn't expect this (for instance if Data.Text used utf8 as its internal Unicode representation), but the current version of the Data.Text library uses UTF16 representation, for which the conversion from latin1 is in fact a trivial conversion consisting in inserting zero octets after or before (depending on whether UTF16LE or UTF16BE is required) each latin1 octet.Triatomic
C
13

The answer is right at the top of the page you linked:

To gain access to a much larger family of encodings, use the text-icu package: http://hackage.haskell.org/package/text-icu

A quick GHCi example:

λ> import Data.Text.ICU.Convert
λ> conv <- open "ISO-8859-1" Nothing
λ> Data.Text.IO.putStrLn $ toUnicode conv $ Data.ByteString.pack [198, 216, 197]
ÆØÅ
λ> Data.ByteString.unpack $ fromUnicode conv $ Data.Text.pack "ÆØÅ"
[198,216,197]

However, as you pointed out, in the specific case of latin-1, the code points coincide with Unicode, so you can use pack/unpack from Data.ByteString.Char8 to perform the trivial mapping from latin-1 from/to String, which you can then convert to Text using the corresponding pack/unpack from Data.Text.

Canonry answered 25/9, 2011 at 11:27 Comment(1)
not being satisfied with the current options to convert from ByteString to Text I finally coded up a direct conversion which performs near-optimal and doesn't expose the IO monad in its API, see github.com/bos/text/pull/18Triatomic

© 2022 - 2024 — McMap. All rights reserved.