Parsing UTF-8 string of known length in Common Lisp one byte at a time
Asked Answered
E

1

6

I'm writing a program in Common Lisp for editing binary files generated by Minecraft that use the NBT format, documented here: http://minecraft.gamepedia.com/NBT_format?cookieSetup=true (I'm aware that such tools exist, such as NBTEditor and MCEdit, but neither are written in Common Lisp, and I thought this project would make an excellent learning exercise).

So far one of the only things that I haven't managed to implement on my own is a function for reading a UTF-8 string of known length that contains characters that are represented using more than one octet (i.e. non-ASCII characters). In the NBT format, every string is UTF-8 encoded and is preceded by a short (two octet) integer n denoting the string's length. So, assuming that only ASCII characters are present in the string, I can simply read a sequence of n octets from the stream and convert it to a string using like so:

(defun read-utf-8-string (string-length byte-stream)
  (let ((seq (make-array string-length :element-type '(unsigned-byte 8)
                                       :fill-pointer t)))
    (setf (fill-pointer seq) (read-sequence seq byte-stream))
    (flexi-streams:octets-to-string seq :external-format :utf-8)))

But if one or more of the characters has a character code greater than 255, it is encoded in two or more bytes, as seen in this example:

(flexi-streams:string-to-octets "wife" :external-format :utf-8)
==> #(119 105 102 101)

(flexi-streams:string-to-octets "жена" :external-format :utf-8)
==> #(208 182 208 181 208 189 208 176)

Both strings have the same length, but each character of the Russian word is encoded in double the number of octets, so the total size of the string is double that of the English one. So knowing the string length does not help if using read-sequence. Even if the size of the string (i.e. the number of octets needed to encode it) was known, there would still be no way of knowing which of those octets to convert to character form individually and which to group together for conversion. So instead of rolling my own function, I tried to find a way to get either the implementation (Clozure CL) or an external library do the work for me. Unfortunately this too has been problematic, because my parser relies on using the same file stream for all reading functions, like this:

(with-open-file (stream "test.dat" :direction :input
                                   :element-type '(unsigned-byte 8))
  ;;Read entire contents of NBT file from stream here)

which limits me to the :element-type '(unsigned-byte 8) and therefore prohibits me from specifying a character encoding and using read-char (or equivalent) like this:

(with-open-file (stream "test.dat" :external-format :utf-8)
  ...)

The :element-type has to be '(unsigned-byte 8) so that I can read and write integers and floats of various sizes. To avoid having to manually convert sequences of octets to strings, I first wondered whether there was a way to change the element type to that of a character while the file is open, which led me to this discussion here: https://groups.google.com/forum/#!searchin/comp.lang.lisp/binary$20write$20read/comp.lang.lisp/N0IESNPSPCU/Qmcvtk0HkC0J Apparently some CL implementations such as SBCL use bivalent streams by default, so both read-byte and read-char can be used on the same stream; if I were to take this approach, I would still need to be able to specify an :external-formatto the stream (:utf-8), although this format should only apply when reading characters, and not when reading raw bytes.

I have used a couple of functions from flexi-streams in the above examples for brevity's sake, but as yet my code uses only the built-in stream types, and I have yet to use flexi-streams themselves. Is this problem a good candidate for flexi-streams? Having an additional layer of abstraction that would allow me to read raw bytes and UTF-8 characters interchangeably from the same stream would be ideal.

Any advice from those familiar with flexi-streams (or other relevant approaches) would be very much appreciated.

Thank you.

Extradition answered 8/8, 2015 at 19:7 Comment(2)
AIUI, the length prefix gives the length of the byte sequence, which you have later to decode. So your first function should be perfectly adequate.Kleist
As it turns out, you are correct! I had slightly misinterpreted the specification on that point. So where previously in my code I calculated the string-length value using (length mystring), I simply had to write (length (flexi-streams:string-to-octets mystring :external-format :utf-8)) instead. This saves having to calculate the number of octets in which each character must be encoded, as in Rainer's answer.Extradition
P
6

Here is something I wrote:

First we want to know how long the encoding for some character actually is, given the first byte.

(defun utf-8-number-of-bytes (first-byte)
  "returns the length of the utf-8 code in number of bytes, based on the first byte.
The length can be a number between 1 and 4."
  (declare (fixnum first-byte))
  (cond ((=       0 (ldb (byte 1 7) first-byte)) 1)
        ((=   #b110 (ldb (byte 3 5) first-byte)) 2)
        ((=  #b1110 (ldb (byte 4 4) first-byte)) 3)
        ((= #b11110 (ldb (byte 5 3) first-byte)) 4)
        (t (error "unknown number of utf-8 bytes for ~a" first-byte))))

Then we decode:

(defun utf-8-decode-unicode-character-code-from-stream (stream)
  "Decodes byte values, from a binary byte stream, which describe a character
encoded using UTF-8.
Returns the character code and the number of bytes read."
  (let* ((first-byte (read-byte stream))
         (number-of-bytes (utf-8-number-of-bytes first-byte)))
    (declare (fixnum first-byte number-of-bytes))
    (ecase number-of-bytes
      (1 (values (ldb (byte 7 0) first-byte)
                 1))
      (2 (values (logior (ash (ldb (byte 5 0) first-byte) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 2))
      (3 (values (logior (ash (ldb (byte 5 0) first-byte) 12)
                         (ash (ldb (byte 6 0) (read-byte stream)) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 3))
      (4 (values (logior (ash (ldb (byte 3 0) first-byte) 18)
                         (ash (ldb (byte 6 0) (read-byte stream)) 12)
                         (ash (ldb (byte 6 0) (read-byte stream)) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 4))
      (t (error "wrong UTF-8 encoding for file position ~a of stream ~s"
                (file-position stream)
                stream)))))

You know how many characters there are. N characters. You can allocate a unicode-capable string for N characters. So you call the function N times. Then, for each result, you convert the result to a character and put it into the string.

Prader answered 8/8, 2015 at 20:56 Comment(1)
Very clever! I will look at this in detail tomorrow and see if I can incorporate it into a working solution. Thank you once again, Rainer, for your prompt and very helpful answer.Extradition

© 2022 - 2024 — McMap. All rights reserved.