Serializing a Data.Text value to a ByteString without unnecessary \NUL bytes
Asked Answered
O

1

6

With the following code, I want to serialize a Data.Text value to a ByteString. Unfortunately my text is prepended with unnecessary NUL bytes and an EOT byte:

GHCi, version 9.4.4: https://www.haskell.org/ghc/  :? for help
ghci> import qualified Data.Text as T
ghci> import Data.Binary
ghci> import Data.Binary.Put
ghci> let txt = T.pack "Text"
ghci> runPut $ put txt
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOTText"
ghci>

Questions:

  • Why are these NUL and EOT bytes generated?
  • How can I avoid them in the resulting ByteString?

PS: I the real code I put the length in front of the text

    foo :: Text -> ByteString
    foo txt = runPut do
        putWord32host $ T.length txt
        put txt
Oreopithecus answered 19/1, 2024 at 20:18 Comment(3)
I think that is the length of the Text as well.Incest
For the empty string it is eight NULs, so zero length, for one character, the last one is replaced with SOH which means one, and so on.Incest
@willeM_ Van Onsem Good observation! It uses 8 bytes for the length field, I need only 2 bytes or 4 bytes.Oreopithecus
I
5

It actually already encodes the length in the binary string. Indeed, if we look at the source code, for the Text instance of Binary, we see [src]:

instance Binary Text where
    put t = put (encodeUtf8 t)
    get   = do
      bs <- get
      case decodeUtf8' bs of
        P.Left exn -> P.fail (P.show exn)
        P.Right a -> P.return a

That's not much of a surprise, we encode it to UTF-8 which produces a ByteString, and then use put on that one. But the length is added when we put the ByteString itself. Indeed, the BinaryString instance of Binary looks like [src]:

instance Binary B.ByteString where
    put bs = put (B.length bs)
             <> putByteString bs
    get    = get >>= getByteString

The put for the ByteString produced by encodeUtf8 thus writes eight bytes to specify the size of the ByteString, this is thus the number of bytes, not (per se the same as) the number of characters in the Text.

If you would want the same effect, but without the length prefix, you can use:

import Data.Text.Encoding

runPut (putByteString (encodeUtf8 txt))

this thus omits the length header.

Incest answered 19/1, 2024 at 20:46 Comment(2)
putByteString has type putByteString :: Data.ByteString.Internal.ByteString -> Put.. However I need Data.Text -> ByteStringOreopithecus
@Jogger: yes, you use encodeUtf8 first to convert it to a ByteString, see edit.Incest

© 2022 - 2025 — McMap. All rights reserved.