I have an HTTPHandler that is reading in a set of CSS files and combining them and then GZipping them. However, some of the CSS files contain a Byte Order Mark (due to a bug in TFS 2005 auto merge) and in FireFox the BOM is being read as part of the actual content so it's screwing up my class names etc. How can I strip out the BOM characters? Is there an easy way to do this without manually going through the byte array looking for ""?
Remove Byte Order Mark from a File.ReadAllBytes (byte[])
Asked Answered
Is the BOM appearing in the actual text itself, or just at the very start? I'd be surprised to see it anywhere other than at the start of the data - in which case simply ignoring the first 3 bytes (assuming UTF-8) should do the trick. –
Urochrome
FWIW, you could open the files in Notepad++ and save them without the Byte Order Mark. It's what I had to do in this question. –
Tuneful
I wrote the following post after coming across this issue. Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve. –
Salerno
Expanding on Jon's comment with a sample.
var name = GetFileName();
var bytes = System.IO.File.ReadAllBytes(name);
System.IO.File.WriteAllBytes(name, bytes.Skip(3).ToArray());
Quote OP: However, some of the CSS files contain a Byte Order Mark. .. ** some ** .. so the code above doesn't check if there's a BOM, before it skips it... –
Eakins
But UTF-32 has a 4-byte BOM. In this case you have to skip 4 –
Pycnometer
Expanding JaredPar sample to recurse over sub-directories:
using System.Linq;
using System.IO;
namespace BomRemover
{
/// <summary>
/// Remove UTF-8 BOM (EF BB BF) of all *.php files in current & sub-directories.
/// </summary>
class Program
{
private static void removeBoms(string filePattern, string directory)
{
foreach (string filename in Directory.GetFiles(directory, file Pattern))
{
var bytes = System.IO.File.ReadAllBytes(filename);
if(bytes.Length > 2 && bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF)
{
System.IO.File.WriteAllBytes(filename, bytes.Skip(3).ToArray());
}
}
foreach (string subDirectory in Directory.GetDirectories(directory))
{
removeBoms(filePattern, subDirectory);
}
}
static void Main(string[] args)
{
string filePattern = "*.php";
string startDirectory = Directory.GetCurrentDirectory();
removeBoms(filePattern, startDirectory);
}
}
}
I had need that C# piece of code after discovering that the UTF-8 BOM corrupts file when you try to do a basic PHP download file.
var text = File.ReadAllText(args.SourceFileName);
var streamWriter = new StreamWriter(args.DestFileName, args.Append, new UTF8Encoding(false));
streamWriter.Write(text);
streamWriter.Close();
Looking at this code, ideally it should work. But, I am surprised that it is saving file in ANSI format. –
Earlearla
new UTF8Encoding(false)
the parameter indicates whether to add the BOM or not. –
Counterblow Another way, assuming UTF-8 to ASCII.
File.WriteAllText(filename, File.ReadAllText(filename, Encoding.UTF8), Encoding.ASCII);
For larger file, use the following code; memory efficient!
StreamReader sr = new StreamReader(path: @"<Input_file_full_path_with_byte_order_mark>",
detectEncodingFromByteOrderMarks: true);
StreamWriter sw = new StreamWriter(path: @"<Output_file_without_byte_order_mark>",
append: false,
encoding: new UnicodeEncoding(bigEndian: false, byteOrderMark: false));
var lineNumber = 0;
while (!sr.EndOfStream)
{
sw.WriteLine(sr.ReadLine());
lineNumber += 1;
if (lineNumber % 100000 == 0)
Console.Write("\rLine# " + lineNumber.ToString("000000000000"));
}
sw.Flush();
sw.Close();
© 2022 - 2024 — McMap. All rights reserved.