StreamWriter and UTF-8 Byte Order Marks
Asked Answered
H

11

79

I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.

I'm creating the stream writer in the following way:

this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);

Any ideas on what could be happening would be appreciated.

Hendeca answered 10/3, 2011 at 21:21 Comment(7)
Note that, while technically allowed in UTF-8, a BOM is neither required or recommended by Unicode (see ref). For one thing, it's useless (unlike for, say, UTF-16)--the UTF-8 byte order is specified by the standard. For another, it can screw up text processing. For instance, many XML parsers will choke if there are any characters before the XML prolog.Musketry
Are you sure sure that you are specifying UTF8? Because if you don't specify it, it will still write an UTF8, but without the BOMFourdrinier
From The Unicode Standard 5.0: The Unicode Standard also specifies the use of an initial byte order mark (BOM) to explicitly differentiate big-endian or little endian data in some of the Unicode encoding schemes.Antihero
Have you resolved this issue? If so, please, mark the correct answer or post your own to help others.Hawley
@Kevin: You did not clarify why you were getting inconsistent results.Flick
Possible duplicate of Create Text File Without BOMSuit
If you don't specify BOM then Excel opens CSV files as if it was UTF7.. Thanks MS Excel ☠ 😢 Everything else is happy to open it as UTF8 like a normal person.Palpebrate
A
133

As someone pointed that out already, calling without the encoding argument does the trick. However, if you want to be explicit, try this:

using (var sw = new StreamWriter(this.Stream, new UTF8Encoding(false)))

To disable BOM, the key is to construct with a new UTF8Encoding(false), instead of just Encoding.UTF8Encoding. This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.

To enable BOM, use new UTF8Encoding(true) instead.

Update: Since Windows 10 v1903, when saving as UTF-8 in notepad.exe, BOM byte is now an opt-in feature instead.

Amanda answered 25/7, 2012 at 17:14 Comment(7)
I don't get it - how can an answer that results in a C# syntax error get 64 up-votes over six years, and nobody mentions that it results in a syntax error?Warms
Haha I guess that's left an as exercise for the reader :P I think I have fixed the error.Amanda
An alternative fix could be using (var sw = new StreamWriter("text.txt", false, new UTF8Encoding(false))).Warms
new UTF8Encoding(false) is the important bit. You don't really believe that people just copy paste things off SO, do you?Riplex
Why false? It should be true. Please check the answer of Nik below. I didn't get it, how can this answer get the top up votes, as it provided the opposite answer.Brower
@ChuckLu I came here searching for a way to remove a BOM, on my platform (linux .net5) emitting it seems to be the default. But good point. Upvoted Nik's answer.Riplex
@ChuckLu it has the most upvotes because majority of people want this symbol removed and didn't really read the question lolMasque
L
22

My answer is based on HelloSam's one which contains all the necessary information. Only I believe what OP is asking for is how to make sure that BOM is emitted into the file.

So instead of passing false to UTF8Encoding ctor you need to pass true.

    using (var sw = new StreamWriter("text.txt", new UTF8Encoding(true)))

Try the code below, open the resulting files in a hex editor and see which one contains BOM and which doesn't.

class Program
{
    static void Main(string[] args)
    {
        const string nobomtxt = "nobom.txt";
        File.Delete(nobomtxt);

        using (Stream stream = File.OpenWrite(nobomtxt))
        using (var writer = new StreamWriter(stream, new UTF8Encoding(false)))
        {
            writer.WriteLine("HelloПривет");
        }

        const string bomtxt = "bom.txt";
        File.Delete(bomtxt);

        using (Stream stream = File.OpenWrite(bomtxt))
        using (var writer = new StreamWriter(stream, new UTF8Encoding(true)))
        {
            writer.WriteLine("HelloПривет");
        }
    }
Library answered 19/3, 2014 at 19:31 Comment(0)
W
21

The issue is due to the fact that you are using the static UTF8 property on the Encoding class.

When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).

You can avoid this by creating the instance of the UTF8Encoding class yourself, like so:

// As before.
this.Writer = new StreamWriter(this.Stream, 
    // Create yourself, passing false will prevent the BOM from being written.
    new System.Text.UTF8Encoding());

As per the documentation for the default parameterless constructor (emphasis mine):

This constructor creates an instance that does not provide a Unicode byte order mark and does not throw an exception when an invalid encoding is detected.

This means that the call to GetPreamble will return an empty array, and therefore no BOM will be written to the underlying stream.

Wellappointed answered 23/3, 2013 at 5:27 Comment(4)
The encoding is a user setting in our program (which sends text messages over TCP)... it's retrieved with a simple parse with enc = Encoding.GetEncoding(...). The only way around I found was to actually add if (enc is UTF8Encoding) enc = new UTF8Encoding(false); behind it. A pretty dirty fix though, but I see no other way to solve it...Derogate
@Derogate That's not the only way. You can abstract the obtaining of the encoding into an interface that given a parameter, gets the encoding. Then you pass/inject an implementation of that interface into your code. It then makes everything quite clean.Wellappointed
That kinda just moves the same thing to a different class. Overall, I just find it utterly bizarre that the GetEncoding somehow manages not to use the default constructor. Ah, well.Derogate
To elaborate, GetPreamble is called internally by StreamWriter (see in the source), hence when called upon UTF8 property (which is constructed internally with UTF8Encoding(true)), it returns BOM as explained in the answer (see also the remarks section).Monochromatic
V
18

The only time I've seen that constructor not add the UTF-8 BOM is if the stream is not at position 0 when you call it. For example, in the code below, the BOM isn't written:

using (var s = File.Create("test2.txt"))
{
    s.WriteByte(32);
    using (var sw = new StreamWriter(s, Encoding.UTF8))
    {
        sw.WriteLine("hello, world");
    }
}

As others have said, if you're using the StreamWriter(stream) constructor, without specifying the encoding, then you won't see the BOM.

Volga answered 10/3, 2011 at 21:45 Comment(2)
I think the "position 0" thing is basically the crucial piece of information regarding this issue.Tiresome
Also, this constructor won't output a BOM either: new StreamWriter("file.txt", Encoding.UTF8)Tiresome
M
6

Do you use the same constructor of the StreamWriter for every file? Because the documentation says:

To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).

I was in a similar situation a while ago. I ended up using the Stream.Write method instead of the StreamWriter and wrote the result of Encoding.GetPreamble() before writing the Encoding.GetBytes(stringToWrite)

Marola answered 10/3, 2011 at 21:40 Comment(0)
C
5

I found this answer useful (thanks to @Philipp Grathwohl and @Nik), but in my case I'm using FileStream to accomplish the task, so, the code that generates the BOM goes like this:

using (FileStream vStream = File.Create(pfilePath))
{
    // Creates the UTF-8 encoding with parameter "encoderShouldEmitUTF8Identifier" set to true
    Encoding vUTF8Encoding = new UTF8Encoding(true);
    // Gets the preamble in order to attach the BOM
    var vPreambleByte = vUTF8Encoding.GetPreamble();

    // Writes the preamble first
    vStream.Write(vPreambleByte, 0, vPreambleByte.Length);

    // Gets the bytes from text
    byte[] vByteData = vUTF8Encoding.GetBytes(pTextToSaveToFile);
    vStream.Write(vByteData, 0, vByteData.Length);
    vStream.Close();
}
Circumvent answered 2/12, 2014 at 14:28 Comment(1)
I mostly found the new UTF8Encoding(true) constructor useful to know.Blas
V
3

Seems that if the file already existed and didn't contain BOM, then it won't contain BOM when overwritten, in other words StreamWriter preserves BOM (or it's absence) when overwriting a file.

Vichy answered 23/6, 2011 at 13:59 Comment(0)
W
1

Could you please show a situation where it don't produce it ? The only case where the preamble isn't present that I can find is when nothing is ever written to the writer (Jim Mischel seem to have find an other, logical and more likely to be your problem, see it's answer).

My test code :

var stream = new MemoryStream();
using(var writer = new StreamWriter(stream, System.Text.Encoding.UTF8))
{
    writer.Write('a');
}
Console.WriteLine(stream.ToArray()
    .Select(b => b.ToString("X2"))
    .Aggregate((i, a) => i + " " + a)
    );
Whipping answered 10/3, 2011 at 21:45 Comment(0)
B
1

After reading the source code of SteamWriter, you need to make sure you are creating a new file, then the byte order mark will add to the file.
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L267
Code in Flush method

if (!_haveWrittenPreamble)
{
_haveWrittenPreamble = true;
ReadOnlySpan preamble = _encoding.Preamble;
if (preamble.Length > 0)
{
_stream.Write(preamble);
}
}

https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129
Code set the value of _haveWrittenPreamble

// If we're appending to a Stream that already has data, don't write
// the preamble.
if (_stream.CanSeek && _stream.Position > 0)
{
_haveWrittenPreamble = true;
}

Brower answered 13/4, 2021 at 5:32 Comment(0)
D
0

using Encoding.Default instead of Encoding.UTF8 solved my problem

Doretheadoretta answered 10/11, 2021 at 11:4 Comment(0)
S
-1

When FileStream is not used and encoding is not specified, file is written in ANSI unless there's a non-english character then it's converted to UTF-8 without BOM.

StreamWriter writer = new StreamWriter("C:\\file.txt");

Adding UTF-8 encoding will create and write file with BOM. Existing file without BOM will have BOM added when overwritten. false means append

StreamWriter writer = new StreamWriter("C:\\file.txt", false, Encoding.UTF8);
Slug answered 9/3, 2023 at 22:35 Comment(4)
Wrong. The docs explicitly say that the default is UTF8 without BOM. ANSI is an ambiguous term anyway. Is it the 7-bit US-ASCII codepage? Latin 1? Or the user's codepage choice for non-Unicode programs, what's erroneously called "system codepage" ?Lashonlashond
Learn well before saying wrong. I'm the author of a Windows text editor PimNote pimtel.com/pimnote so I know and tested things well before saying. I also wrote my own method to detect a UTF-8 file without BOM. You should know that a file has no encoding. The bytes of each character in the file must be interpreted to determine the encoding. Text edtiors fallback to 8-bit ANSI when a unicode or BOM is not found. Most real-world non-english characters use at least 2 bytes, not just bits or Latin. Saying UTF-8 without BOM is confusing to people who see files in text editors as ANSI.Slug
Learn well before saying wrong., no better way to learn than the actual source, which shows you're wrong. StreamWriter always uses a stream and always uses UTF8 without BOM unless explicitly specified. Your comment is just word gymnastics trying to prove a wrong assertion. I've been using Unicode for over 30 years (no Greek characters in Latin 1) and actually remember what working without it means, having to fix mangled text or handle files stored with an encoding that didn't match the system's.Lashonlashond
As for encoding detection, it's easy to check for non-UTF8. Just check for invalid bytes. Detecting the actual encoding is hard. Browsers, starting with Netscape use statistics to find probable encodings. Back in the 2000s, when many web sites didn't even include a charset tag, that was a real problem. It still is in many government sites and unfortunately, lots of localized data files used for analytics. Libraries like chardet in Python are used to detect the actual encodingLashonlashond

© 2022 - 2024 — McMap. All rights reserved.