How to GetBytes() in C# with UTF8 encoding with BOM?
Asked Answered
I

4

65

I'm having a problem with UTF8 encoding in my asp.net mvc 2 application in C#. I'm trying let user download a simple text file from a string. I am trying to get bytes array with the following line:

var x = Encoding.UTF8.GetBytes(csvString);

but when I return it for download using:

return File(x, ..., ...);

I get a file which is without BOM so I don't get Croatian characters shown up correctly. This is because my bytes array does not include BOM after encoding. I triend inserting those bytes manually and then it shows up correctly, but that's not the best way to do it.

I also tried creating UTF8Encoding class instance and passing a boolean value (true) to its constructor to include BOM, but it doesn't work either.

Anyone has a solution? Thanks!

Intermarriage answered 10/12, 2010 at 23:5 Comment(0)
C
162

Try like this:

public ActionResult Download()
{
    var data = Encoding.UTF8.GetBytes("some data");
    var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
    return File(result, "application/csv", "foo.csv");
}

The reason is that the UTF8Encoding constructor that takes a boolean parameter doesn't do what you would expect:

byte[] bytes = new UTF8Encoding(true).GetBytes("a");

The resulting array would contain a single byte with the value of 97. There's no BOM because UTF8 doesn't require a BOM.

Centurial answered 10/12, 2010 at 23:11 Comment(7)
Thanks! I was going crazy with my special characters not working in Excel CSV :)Volumed
For clarity, Encoding.UTF8 is equivalent to new UTF8Encoding(true). The parameter controls whether GetPreamble() will emit a BOM.Mizell
There's no BOM because GetBytes can't assume we're writing to a file. Whoever writes to the file should do the preamble thing first (like a StreamWriter, for example).Calcine
Why content type is set to "application/csv" instead of "text/csv" (as shown here)? In any case, neither way works, here. Excel still opens it with unrecognizable characters.Gerek
The MIME type should be: text/csv, see here (and if you want to be more precise then use: text/csv; charset=utf-8, see here).Proconsul
If I use contentType of application/csv it works fine, but if I replace it with text/csv it stops working, maybe someone has a clue why is that?Nazareth
I was having this issues as well and this is the only solution that worked for me, there are other suggestions about telling the user to change the encoding but that doesn't work when you have thousands of users complaining of random encoding issues, customer never read the instructions is better to provide the file in the format that will work as expected.Mccreery
V
23

I created a simple extension to convert any string in any encoding to its representation of byte array when it is written to a file or stream:

public static class StreamExtensions
{
    public static byte[] ToBytes(this string value, Encoding encoding)
    {
        using (var stream = new MemoryStream())
        using (var sw = new StreamWriter(stream, encoding))
        {
            sw.Write(value);
            sw.Flush();
            return stream.ToArray();
        }
    }
}

Usage:

stringValue.ToBytes(Encoding.UTF8)

This will work also for other encodings like UTF-16 which requires the BOM.

Vivle answered 15/6, 2015 at 7:28 Comment(2)
This is actually a very useful workaround. The use of a StreamWriter, with encoding, solved my immediate problem and allowed my file to be opened with Excel 2013.Chaworth
Thanks. It`s helped me to save .csv with arabic characters. Using Encoding.GetBytes returned bad file, with unknown characters.Vancouver
B
2

UTF-8 does not require a BOM, because it is a sequence of 1-byte words. UTF-8 = UTF-8BE = UTF-8LE.

In contrast, UTF-16 requires a BOM at the beginning of the stream to identify whether the remainder of the stream is UTF-16BE or UTF-16LE, because UTF-16 is a sequence of 2-byte words and the BOM identifies whether the bytes in the words are BE or LE.

The problem does not lie with the Encoding.UTF8 class. The problem lies with whatever program you are using to view the files.

Bibliomania answered 10/12, 2010 at 23:11 Comment(6)
UTF-8 is a variable width encoding. It only requires 1 byte to encode ASCII characters, but other code points will use multiple bytes.Baculiform
The codepoints encoded with multiple bytes have a pre-defined order (based on the U+ big-endian representation). However, since UTF8 is represented as a stream of bytes (rather than as a stream of words or dwords which are themselves represented as a sequence of bytes), the concept of endianness doesn't apply. Endianness is applicable to the representation of 16-, 32-, 64-, 128-bit integers as bytes, not to the representation of codepoints as bytes.Bibliomania
Sorry, I thought you were referring to the storage of codepoints with the phrase "sequence of 1 byte words". Thanks for the clarification. +1 for your answer and comment.Baculiform
Some programs use it to detect the encoding as being UTF-8. Programs that don't require it should ignore it as the character emitted is something that is to be ignored anyway. It's older programs that can't handle the BOM.Calcine
It does, if you wanna, say, open a UTF-8 file that has surrogate pairs in Visual Studio...Dogmatic
@Bibliomania Sorry, but although I agree with you regarding the lack of encoding recognition in some programs, when the faulty program is something as widespread as Excel 2016 opening CSV files, answers like the ones of Hovhannes Hakobyan or Darin Dimitrov are much more helpful than yours.Quash
L
-2

Remember that .NET strings are all unicode while there stay in memory, so if you can see your csvString correctly with the debugger the problem is writing the file.

In my opinion you should return a FileResult with the same encoding that the files. Try setting the returning File encoding,

Leet answered 10/12, 2010 at 23:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.