How to remove BOM from byte array
Asked Answered
U

5

10

I have xml data in byte[] byteArray which may or mayn't contain BOM. Is there any standard way in C# to remove BOM from it? If not, what is the best way, which handles all the cases including all types of encoding, to do the same?

Actually, I am fixing a bug in the code and I don't want to change much of the code. So it would be better if someone can give me the code to remove BOM.

I know that I can do like find out 60 which is ASCII value of '<' and ignore bytes before that but I don't want to do that.

Unlatch answered 18/3, 2013 at 11:49 Comment(2)
Can the data be either UTF-8 (with or without byte-order-mark) or UTF16 (with or withour BOM; little-endian or big-endian)?Joslin
I have edited your title. Please see, "Should questions include “tags” in their titles?", where the consensus is "no, they should not".Sensorimotor
S
10

All of the C# XML parsers will automatically handle the BOM for you. I'd recommend using XDocument - in my opinion it provides the cleanest abstraction of XML data.

Using XDocument as an example:

using (var stream = new memoryStream(bytes))
{
  var document = XDocument.Load(stream);
  ...
}

Once you have an XDocument you can then use it to omit the bytes without the BOM:

using (var stream = new MemoryStream())
using (var writer = XmlWriter.Create(stream))
{
  writer.Settings.Encoding = new UTF8Encoding(false);
  document.WriteTo(writer);
  var bytesWithoutBOM = stream.ToArray();
}
Seltzer answered 18/3, 2013 at 11:53 Comment(6)
actually i want to remove BOM only and don't have to care about parsing and all. I have updated the question as well.Unlatch
@RaviGupta I see, do you know the encoding?Faria
it would be better if the logic be encoding free.Unlatch
@RaviGupta Answer updated. There may be a more efficient way, perhaps looking at the internals of XmlReader to see how they detect the BOM, however what I have written above should work fine.Faria
can we do it for all encoding? like instead of doing writer.Settings.Encoding = new UTF8Encoding(false); can we do writer.Settings.Encoding = new Encoding .... something like thatUnlatch
@RaviGupta The above code will 'normalise' the encoding to be UTF8. An encoding must be specified when writing out the bytes, you can choose an alternate however, UTF8 was chosen arbitrarily.Faria
A
3

You don't have to worry about BOM.

If for some reason you need to use and XmlDocument object maybe this code can help you:

byte[] file_content = {wherever you get it};
XmlDocument xml = new XmlDocument();
xml.Load(new MemoryStream(file_content));

It worked for me when i tried to download an xml attachment from a gmail account using Google Api and the file have BOM and using Encoding.UTF8.GetString(file_content) didn't work "properly".

Alagez answered 17/2, 2019 at 1:52 Comment(0)
K
2

You could do something like this to skip the BOM bytes while reading from a stream. You would need to extend the Bom.cs to include further encodings, however afaik UTF is the only encoding using BOM... could (most likely) be wrong about that though.

I got the info on the encoding types from here

using (var stream = File.OpenRead("path_to_file"))
{
    stream.Position = Bom.GetCursor(stream);
}


public static class Bom
{
        public static int GetCursor(Stream stream)
        {
            // UTF-32, big-endian
            if (IsMatch(stream, new byte[] {0x00, 0x00, 0xFE, 0xFF}))
                return 4;
            // UTF-32, little-endian
            if (IsMatch(stream, new byte[] { 0xFF, 0xFE, 0x00, 0x00 }))
                return 4;
            // UTF-16, big-endian
            if (IsMatch(stream, new byte[] { 0xFE, 0xFF }))
                return 2;
            // UTF-16, little-endian
            if (IsMatch(stream, new byte[] { 0xFF, 0xFE }))
                return 2;
            // UTF-8
            if (IsMatch(stream, new byte[] { 0xEF, 0xBB, 0xBF }))
                return 3;
            return 0;
        }

        private static bool IsMatch(Stream stream, byte[] match)
        {
            stream.Position = 0;
            var buffer = new byte[match.Length];
            stream.Read(buffer, 0, buffer.Length);
            return !buffer.Where((t, i) => t != match[i]).Any();
        }
    }
Kaete answered 1/5, 2013 at 9:43 Comment(0)
H
2

What you can also do is use a StreamReader.

Assuming you have a MemoryStream ms

    using (StreamReader sr = new StreamReader(new MemoryStream(ms.ToArray()), Encoding.UTF8))
    {
         var bytesWithoutBOM = new UTF8Encoding(false).GetBytes(sr.ReadToEnd());
         var stringWithoutBOM = Convert.ToBase64String(bytesWithoutBOM );
    }
Hindrance answered 16/6, 2021 at 12:55 Comment(0)
E
0

You'll have to identify the byte order marks at the beginning of the byte array. There are several different combinations, as described at http://www.unicode.org/faq/utf_bom.html#bom1.

Just create a little state machine that starts at the beginning of the byte array and looks for those sequences.

I don't know how your array is used or what other parameters you use with it, so I can't really say how you'd "remove" the sequence. Your options appear to be:

  1. If you have start and count parameters, you can just change those to reflect the starting point of the array (beyond the BOM).
  2. If you just have a count parameter (other than the array's Length property), you can move data in the array to overwrite the BOM, and adjust the count accordingly.
  3. If you don't have start or count parameters, then you'll want to create a new array that's the size of the old array minus the BOM, and copy the data into the new array.

To "remove" the sequence, you'd probably want to identify the mark if it's there and then copy the remaining bytes to a new byte array. Or, if you maintain a count of characters (other than the array's Length property)

Electrolyze answered 18/3, 2013 at 13:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.