Remove Byte Order Mark from a File.ReadAllBytes (byte[])
Asked Answered
E

5

14

I have an HTTPHandler that is reading in a set of CSS files and combining them and then GZipping them. However, some of the CSS files contain a Byte Order Mark (due to a bug in TFS 2005 auto merge) and in FireFox the BOM is being read as part of the actual content so it's screwing up my class names etc. How can I strip out the BOM characters? Is there an easy way to do this without manually going through the byte array looking for ""?

Elijah answered 13/11, 2008 at 20:12 Comment(3)
Is the BOM appearing in the actual text itself, or just at the very start? I'd be surprised to see it anywhere other than at the start of the data - in which case simply ignoring the first 3 bytes (assuming UTF-8) should do the trick.Urochrome
FWIW, you could open the files in Notepad++ and save them without the Byte Order Mark. It's what I had to do in this question.Tuneful
I wrote the following post after coming across this issue. Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.Salerno
B
8

Expanding on Jon's comment with a sample.

var name = GetFileName();
var bytes = System.IO.File.ReadAllBytes(name);
System.IO.File.WriteAllBytes(name, bytes.Skip(3).ToArray());
Bicknell answered 14/11, 2008 at 2:54 Comment(2)
Quote OP: However, some of the CSS files contain a Byte Order Mark. .. ** some ** .. so the code above doesn't check if there's a BOM, before it skips it...Eakins
But UTF-32 has a 4-byte BOM. In this case you have to skip 4Pycnometer
A
6

Expanding JaredPar sample to recurse over sub-directories:

using System.Linq;
using System.IO;
namespace BomRemover
{
    /// <summary>
    /// Remove UTF-8 BOM (EF BB BF) of all *.php files in current & sub-directories.
    /// </summary>
    class Program
    {
        private static void removeBoms(string filePattern, string directory)
        {
            foreach (string filename in Directory.GetFiles(directory, file  Pattern))
            {
                var bytes = System.IO.File.ReadAllBytes(filename);
                if(bytes.Length > 2 && bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF)
                {
                    System.IO.File.WriteAllBytes(filename, bytes.Skip(3).ToArray()); 
                }
            }
            foreach (string subDirectory in Directory.GetDirectories(directory))
            {
                removeBoms(filePattern, subDirectory);
            }
        }
        static void Main(string[] args)
        {
            string filePattern = "*.php";
            string startDirectory = Directory.GetCurrentDirectory();
            removeBoms(filePattern, startDirectory);            
        }       
    }
}

I had need that C# piece of code after discovering that the UTF-8 BOM corrupts file when you try to do a basic PHP download file.

Autarch answered 19/5, 2010 at 8:23 Comment(0)
K
3
var text = File.ReadAllText(args.SourceFileName);
var streamWriter = new StreamWriter(args.DestFileName, args.Append, new UTF8Encoding(false));
streamWriter.Write(text);
streamWriter.Close();
Karisakarissa answered 16/7, 2009 at 9:50 Comment(2)
Looking at this code, ideally it should work. But, I am surprised that it is saving file in ANSI format.Earlearla
new UTF8Encoding(false) the parameter indicates whether to add the BOM or not.Counterblow
B
1

Another way, assuming UTF-8 to ASCII.

File.WriteAllText(filename, File.ReadAllText(filename, Encoding.UTF8), Encoding.ASCII);
Biron answered 14/11, 2008 at 8:32 Comment(0)
P
0

For larger file, use the following code; memory efficient!

StreamReader sr = new StreamReader(path: @"<Input_file_full_path_with_byte_order_mark>", 
                    detectEncodingFromByteOrderMarks: true);

StreamWriter sw = new StreamWriter(path: @"<Output_file_without_byte_order_mark>", 
                    append: false, 
                    encoding: new UnicodeEncoding(bigEndian: false, byteOrderMark: false));

var lineNumber = 0;
while (!sr.EndOfStream)
{
    sw.WriteLine(sr.ReadLine());
    lineNumber += 1;
    if (lineNumber % 100000 == 0)
        Console.Write("\rLine# " + lineNumber.ToString("000000000000"));
}

sw.Flush();
sw.Close();
Persistent answered 14/3, 2018 at 13:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.