How can I detect if a .NET StreamReader found a UTF8 BOM on the underlying stream?
Asked Answered
Z

3

23

I get a FileStream(filename,FileMode.Open,FileAccess.Read,FileShare.ReadWrite) and then a StreamReader(stream,true).

Is there a way I can check if the stream started with a UTF8 BOM? I am noticing that files without the BOM are read as UTF8 by the StreamReader.

How can I tell them apart?

Zaffer answered 16/2, 2011 at 3:15 Comment(0)
C
17

Rather than hardcoding the bytes, it is prettier to use the API

public string ConvertFromUtf8(byte[] bytes)
{
  var enc = new UTF8Encoding(true);
  var preamble = enc.GetPreamble();
  if (preamble.Where((p, i) => p != bytes[i]).Any()) 
    throw new ArgumentException("Not utf8-BOM");
  return enc.GetString(bytes.Skip(preamble.Length).ToArray());
}
Casper answered 27/2, 2012 at 14:18 Comment(1)
@carlo-v-dango, I'd recommend adding some kind of null-check since bytes may be empty if file is empty. if (preamble.Where((p, i) => bytes.Length > i && p != bytes[i]).Any()) or whatever floats your boat.Afterthought
S
13

You can detect whether the StreamReader encountered a BOM by initializing it with a BOM-less UTF8 encoding and checking to see if CurrentEncoding changes after the first read.

var utf8NoBom = new UTF8Encoding(false);
using (var reader = new StreamReader(file, utf8NoBom))
{
    reader.Read();
    if (Equals(reader.CurrentEncoding, utf8NoBom))
    {
        Console.WriteLine("No BOM");
    }
    else
    {
        Console.WriteLine("BOM detected");
    }
}
Stag answered 16/1, 2015 at 2:51 Comment(2)
I never would have thought that this would work. Thanks! It is really too bad that the opposite isn't true. You can't pass int UTF8Encoding(true) and have it return UTF8Encoding(false).Paine
Nice! You can also use reader.Peek() instead of reader.Read()Bruyn
A
8

Does this help? You check the first three bytes of the file:

    public static void Main(string[] args)
    {
        FileStream fs = new FileStream("spork.txt", FileMode.Open);
        byte[] bits = new byte[3];
        fs.Read(bits, 0, 3);

        // UTF8 byte order mark is: 0xEF,0xBB,0xBF
        if (bits[0] == 0xEF && bits[1] == 0xBB && bits[2] == 0xBF)
        {

        }

        Console.ReadLine();
    }
}
Alsoran answered 16/2, 2011 at 3:49 Comment(2)
Make sure to put the FileStream into a using statement as it is a disposable object.Meaghanmeagher
Conventionally, it's better to use the preamble rather than hard-coded byte values.Bayle

© 2022 - 2024 — McMap. All rights reserved.