c# Detect xml encoding from Byte Array?
Asked Answered
M

4

6

Well i have a byte array, and i know its a xml serilized object in the byte array is there any way to get the encoding from it?

Im not going to deserilize it but im saving it in a xml field on a sql server... so i need to convert it to a string?

Melisent answered 24/2, 2009 at 10:55 Comment(1)
any final solution with full source code sample working about it ?Demarcusdemaria
G
7

You could look at the first 40-ish bytes1. They should contain the document declaration (assuming it has an document declaration) which should either contain the encoding or you can assume it's UTF-8 or UTF-16, which should should be obvious from how you've understood the <?xml part. (Just check for both patterns.)

Realistically, do you expect you'll ever get anything other than UTF-8 or UTF-16? If not, you could check for the patterns you get at the start of both of those and throw an exception if it doesn't follow either pattern. Alternatively, if you want to make another attempt, you could always try to decode the document as UTF-8, re-encode it and see if you get the same bytes back. It's not ideal, but it might just work.

I'm sure there are more rigorous ways of doing this, but they're likely to be finicky :)


1 Quite possibly less than this. I figure 20 characters should be enough, which is 40 bytes in UTF-16.

Granulation answered 24/2, 2009 at 11:5 Comment(3)
Downvoters: if you're going to downvote, please provide a comment. Otherwise the downvote serves no real purpose.Granulation
any final solution with full source code sample working about it ?Demarcusdemaria
@Kiquenet: Not from me, I'm afraid. I don't have time to come back to this right now.Granulation
S
14

A solution similar to this question could solve this by using a Stream over the byte array. Then you won't have to fiddle at the byte level. Like this:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}
Semifinal answered 12/3, 2009 at 10:56 Comment(0)
S
8

The W3C XML specification has a section on how to determine the encoding of a byte string.

First check for a Unicode Byte Order Mark

A BOM is just another character; it's the:

'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)

For example:

  • NWNBSP<?xml vers
  • "\ufeff<xml vers"
  • "\ufeff\u003c\u003f\u0078\u006d\u006c\u0020\u0076\u0065\u0072\u0073"
  • U+FEFFU+003CU+003FU+0078U+006DU+006CU+0020U+0076U+0065U+0072U+0073

The character U+FEFF, along with every other character in the file, is encoded using the appropriate encoding scheme:

  • 00 00 FE FF: UCS-4, big-endian machine (1234 order)
  • FF FE 00 00: UCS-4, little-endian machine (4321 order)
  • 00 00 FF FE: UCS-4, unusual octet order (2143)
  • FE FF 00 00: UCS-4, unusual octet order (3412)
  • FE FF ## ##: UTF-16, big-endian
  • FF FE ## ##: UTF-16, little-endian
  • EF BB BF: UTF-8

where ## ## can be anything - except for both being zero

  • U+FEFFU+003CU+003FU+0078U+006DU+006CU+0020U+0076U+0065U+0072U+0073
  • ff fe3c 003f 0078 006d 006c 0020 0076 0065 0072 0073 00
  • ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 65 00 72 00 73 00

So first check the inital bytes for any of those signatures. If you find one of them, return that code-page identifier

UInt32 GuessEncoding(byte[] XmlString)
{
   if BytesEqual(XmlString, [00, 00, $fe, $ff]) return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$ff, $fe, 00, 00]) return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $ff, $fe]) throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff, 00, 00]) throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   }
   if BytesEqual(XmlString, [$ff, $fe])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   }
   if BytesEqual(XmlString, [$ef, $bb, $bf])    return 65001; //"utf-8" - Unicode (UTF-8)

Or else look for <?xml

If the XML document has no Byte Order Mark character, then you move on to looking for the first five characters that every XML document must have:

<?xml

It's helpful to know that

  • < is #x0000003C
  • ? is #x0000003F

With that we have enough to look at the first four bytes:

  • 00 00 00 3C: UCS-4, big-endian machine (1234 order)
  • 3C 00 00 00: UCS-4, little-endian machine (4321 order)
  • 00 00 3C 00: UCS-4, unusual octet order (2143)
  • 00 3C 00 00: UCS-4, unusual octet order (3412)
  • 00 3C 00 3F: UTF-16, big-endian
  • 3C 00 3F 00: UTF-16, little-endian
  • 3C 3F 78 6D: UTF-8
  • 4C 6F A7 94: some flavor of EBCDIC

So we can then add more to our code:

   if BytesEqual(XmlString, [00, 00, 00, $3C])    return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, 00, 00])    return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $3C, 00])    throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, 00])    throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, $3F])   return return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, $3F, 00])   return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   if BytesEqual(XmlString, [$3C, $3F, $78, $6D]) return 65001; //"utf-8" - Unicode (UTF-8)
   if BytesEqual(XmlString, [$4C, $6F, $A7, $94])
   {
      //Some variant of EBCDIC, e.g.:
      //20273   IBM273  IBM EBCDIC Germany
      //20277   IBM277  IBM EBCDIC Denmark-Norway
      //20278   IBM278  IBM EBCDIC Finland-Sweden
      //20280   IBM280  IBM EBCDIC Italy
      //20284   IBM284  IBM EBCDIC Latin America-Spain
      //20285   IBM285  IBM EBCDIC United Kingdom
      //20290   IBM290  IBM EBCDIC Japanese Katakana Extended
      //20297   IBM297  IBM EBCDIC France
      //20420   IBM420  IBM EBCDIC Arabic
      //20423   IBM423  IBM EBCDIC Greek
      //20424   IBM424  IBM EBCDIC Hebrew
      //20833   x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended
      //20838   IBM-Thai    IBM EBCDIC Thai
      //20866   koi8-r  Russian (KOI8-R); Cyrillic (KOI8-R)
      //20871   IBM871  IBM EBCDIC Icelandic
      //20880   IBM880  IBM EBCDIC Cyrillic Russian
      //20905   IBM905  IBM EBCDIC Turkish
      //20924   IBM00924    IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
      throw new Exception("We don't support EBCDIC. Sorry");
   }

   //Otherwise assume UTF-8, and fail to decode it anyway
   return 65001; //"utf-8" - Unicode (UTF-8)

   //Any code is in the public domain. No attribution required.
}
Slumlord answered 23/2, 2016 at 19:19 Comment(0)
G
7

You could look at the first 40-ish bytes1. They should contain the document declaration (assuming it has an document declaration) which should either contain the encoding or you can assume it's UTF-8 or UTF-16, which should should be obvious from how you've understood the <?xml part. (Just check for both patterns.)

Realistically, do you expect you'll ever get anything other than UTF-8 or UTF-16? If not, you could check for the patterns you get at the start of both of those and throw an exception if it doesn't follow either pattern. Alternatively, if you want to make another attempt, you could always try to decode the document as UTF-8, re-encode it and see if you get the same bytes back. It's not ideal, but it might just work.

I'm sure there are more rigorous ways of doing this, but they're likely to be finicky :)


1 Quite possibly less than this. I figure 20 characters should be enough, which is 40 bytes in UTF-16.

Granulation answered 24/2, 2009 at 11:5 Comment(3)
Downvoters: if you're going to downvote, please provide a comment. Otherwise the downvote serves no real purpose.Granulation
any final solution with full source code sample working about it ?Demarcusdemaria
@Kiquenet: Not from me, I'm afraid. I don't have time to come back to this right now.Granulation
P
7

The first 2 or 3 bytes may be a Byte Order Mark (BOM) which can tell you whether the stream is UTF-8, Unicode-LittleEndian or Unicode-BigEndian.

UTF-8 BOM is 0xEF 0xBB 0xBF Unicode-Bigendian is 0xFE 0xFF Unicode-LittleEndiaon is 0xFF 0xFE

If none of these are present then you can use ASCII to test for <?xml (note most modern XML generation sticks to the standard that no white space may preceed the xml declare).

ASCII is used up until ?> so you can find the presence of encoding= and find its value. If encoding isn't present or <?xml declare is not present then you can assume UTF-8.

Philbrick answered 24/2, 2009 at 11:8 Comment(1)
any final solution with full source code sample working about it ?Demarcusdemaria

© 2022 - 2024 — McMap. All rights reserved.