Ignoring specified encoding when deserializing XML
Asked Answered
D

2

8

I am trying to read some XML received from an external interface over a socket. The problem is that the encoding is specified wrong in the XML-header (it says iso-8859-1, but it is utf-16BE). It is documented that the encoding is utf-16BE, but apparently they forgot to set the correct encoding.

To ignore the encoding when I deserialize I use a StringReader like this:

    private static T DeserializeXmlData<T>(byte[] xmlData)
    {
        var xmlString = Encoding.BigEndianUnicode.GetString(xmlData);
        using (var reader = new StringReader(xmlString))
        {
            reader.ReadLine(); // Eat header line
            using (var xmlReader = XmlReader.Create(reader))
            {
                var serializer = new XmlSerializer(typeof(T));
                return (T)serializer.Deserialize(xmlReader);
            }
        }
    }

The above actually works fine, but I don't like the part where I just skip the header line by calling ReadLine. Is there a less brittle way to bypass the encoding specified in the XML-header?

Solution with StreamReader

By using a StreamReader, I can override the encoding specified in the XML-header. Specifying XmlReaderSettings.IgnoreProcessingInstructions or not did not do any difference. Interestingly the StreamReader ignores the specified encoding if it finds a unicode byte-order mark.

To recap:

  • If the XmlReader is initialized with a TextReader, XML-header encoding is ignored.
  • If a StringReader is used, the XmlReader fails if a unicode byte-order mark exists.
  • If a StreamReader is used, a unicode byte-order mark overrides the StreamReader encoding.
  • XmlReaderSettings.IgnoreProcessingInstructions = true doesn't make a difference when using a TextReader.

In conclusion, the most robust solution seems to be using a StreamReader, since it uses the byte-order mark, if present.

    private static T DeserializeXmlData<T>(byte[] xmlData)
    {
        using (var xmlDataStream = new MemoryStream(xmlData))
        {
            using (var reader = new StreamReader(xmlDataStream, Encoding.BigEndianUnicode))
            {
                using (var xmlReader = XmlReader.Create(reader))
                {
                    var serializer = new XmlSerializer(typeof (T));
                    return (T) serializer.Deserialize(xmlReader);
                }
            }
        }
    }
Danica answered 27/10, 2010 at 14:14 Comment(0)
A
4

I think I'd just use a StreamReader, constructed with the right encoding and pass that to the XmlReader.Create(TextStream) method:

 using (var sr = new StreamReader(@"c:\temp\bad.xml", Encoding.BigEndianUnicode)) {
     using (var xr = XmlReader.Create(sr, new XmlReaderSettings())) {
         // etc...
     }
 }
Aromaticity answered 27/10, 2010 at 15:30 Comment(0)
A
1

If there are no other relevant processing instructions, you can just ignore them by setting XmlReaderSettings.IgnoreProcessingInstructions.

Arabella answered 27/10, 2010 at 14:19 Comment(1)
Great! How would I specify the "true" encoding then? (an XmlReader based on a StringReader throws an exception even with IgnoreProcessingInstructions set to true).Danica

© 2022 - 2024 — McMap. All rights reserved.