How do you read UTF-8 characters from an infinite byte stream - C#
Asked Answered
O

4

6

Normally, to read characters from a byte stream you use a StreamReader. In this example I'm reading records delimited by '\r' from an infinite stream.

using(var reader = new StreamReader(stream, Encoding.UTF8))
{
    var messageBuilder = new StringBuilder();
    var nextChar = 'x';
    while (reader.Peek() >= 0)
    {
        nextChar = (char)reader.Read()
        messageBuilder.Append(nextChar);

        if (nextChar == '\r')
        {
            ProcessBuffer(messageBuilder.ToString());
            messageBuilder.Clear();
        }
    }
}

The problem is that the StreamReader has a small internal buffer, so if the code waiting for an 'end of record' delimiter ('\r' in this case) it has to wait until the StreamReader's internal buffer is flushed (usually because more bytes have arrived).

This alternative implementation works for single byte UTF-8 characters, but will fail on multibyte characters.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
    var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}

How can I modify this code so that it works with multi-byte characters?

Overthrust answered 26/7, 2012 at 14:42 Comment(5)
Shouldn't the title be modified to say multi-byte or UTF-16 characters instead of UTF-8? Seems misleading.Beige
@TimS. UTF-8 characters can be more than a single byte.Selvage
@TimS. what do you mean? A multibyte UTF-8 character does not automagically become an UTF-16 character. Wiki.Ferous
UTF-8 characters can be mult-byte en.wikipedia.org/wiki/Utf-8Overthrust
@MikeHadlow ah, thanks for the correction and info. I didn't realize that UTF-8 could include multi-byte characters.Beige
L
10

Rather than Encoding.UTF8.GetChars which is designed to convert complete buffers, get an instance of Decoder and repeatedly call its member method GetChars this will make use of the Decoder's internal buffer to handle partial multi-byte sequences from the end of one call to the next.

Lebar answered 26/7, 2012 at 14:48 Comment(1)
Thanks Richard, that works great. See my answer for my implementation.Overthrust
O
7

Thanks to Richard, I now have a working infinite stream reader. As he explained, the trick is to use a Decoder instance and call its GetChars method. I've tested it with multi-byte Japanese text and it works fine.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];

while ((byteAsInt = stream.ReadByte()) != -1)
{
    var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
    if(charCount == 0) continue;

    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}
Overthrust answered 26/7, 2012 at 15:7 Comment(0)
R
1

I don't understand why you're not using the stream reader's ReadLine method. If there's a good reason not to, however, it nonetheless seems to me that repeatedly calling GetChars on the decoder is inefficient. Why not make use of the fact that the byte representation of '\r' can't be part of a multi-byte sequence? (Bytes in a multi-byte sequence must be greater than 127; that is, they have the highest bit set.)

var messageBuilder = new List<byte>();

int byteAsInt;
while ((byteAsInt = stream.ReadByte()) != -1)
{
    messageBuilder.Add((byte)byteAsInt);

    if (byteAsInt == '\r')
    {
        var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray());
        Console.Write(messageString);
        ProcessBuffer(messageString);
        messageBuilder.Clear();
    }
}
Recreant answered 26/7, 2012 at 22:54 Comment(1)
Wait, are you seriously saying calling GetChars on the decoder inefficient, while reading the stream byte-by-byte, putting it inside a List of bytes and then building a byte array out of that list and calling Encoding.GetString? Seems like you've missed the big performance issue for the small one :) ... oh, I see the OP did the same thing. Never mind.Vercelli
F
1

Mike, I found your solution perfect for my situation as well. But I noticed that sometimes it takes four GetChar() calls to determine the characters to be returned. This meant that charCount was 2, while my nextChar buffer size was 1. So I got error "The output character buffer is too small to contain the decoded characters, encoding Unicode fallback System.Text.DecoderReplacementFallback."

I changed my code to:

    // ...
    var nextChar = new char[4];  // 2 might suffice

    for (var i = startPos; i < bytesRead; i++)
    {
        int charCount;
        //...
        charCount = decoder.GetChars(buffer, i, 1, nextChar, 0);

        if (charCount == 0)
        {
            bytesSkipped++;
            continue;
        }

        for (int ic = 0; ic < charCount; ic++)
        {
            char c = nextChar[ic];
            charPos++;

            // Process character here...
        }
    }
Fixer answered 2/10, 2018 at 9:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.