fast way to deserialize XML with special characters
Asked Answered
J

3

7

I am looking for fast way to deserialize xml, that has special characters in it like ö.

I was using XMLReader and it fails to deserialze such characters.

Any suggestion?

EDIT: I am using C#. Code is as follows:

XElement element =.. //has the xml
XmlSerializer serializer =   new XmlSerializer(typeof(MyType));
XmlReader reader = element.CreateReader();
Object o= serializer.Deserialize(reader);
Jehoash answered 4/2, 2011 at 15:43 Comment(5)
What language/platform? What encoding are you using? Can you post your code?Sills
deserialization ? you mean parsing ? for what language/purpose is it ?Winze
In what context to the chars appear? Is it actually valid XML, or just an XML-alike?Waste
it is valid xml, it appears when xml contains german/ japanese charactersJehoash
XmlReader does handles all characters, but it may be an encoding issue. Could you post a full reproduction, and a full stack trace as well?Monomania
H
8

I'd guess you're having an encoding issue, not in the XMLReader but with the XmlSerializer.

You could use the XmlTextWriter and UTF8 encoding with the XmlSerializer like in the following snippet (see the generic methods below for a way nicer implementation of it). Works just fine with umlauts (äöü) and other special characters.

class Program
{
    static void Main(string[] args)
    {
        SpecialCharacters specialCharacters = new SpecialCharacters { Umlaute = "äüö" };

        // serialize object to xml

        MemoryStream memoryStreamSerialize = new MemoryStream();
        XmlSerializer xmlSerializerSerialize = new XmlSerializer(typeof(SpecialCharacters));
        XmlTextWriter xmlTextWriterSerialize = new XmlTextWriter(memoryStreamSerialize, Encoding.UTF8);

        xmlSerializerSerialize.Serialize(xmlTextWriterSerialize, specialCharacters);
        memoryStreamSerialize = (MemoryStream)xmlTextWriterSerialize.BaseStream;

        // converts a byte array of unicode values (UTF-8 enabled) to a string
        UTF8Encoding encodingSerialize = new UTF8Encoding();
        string serializedXml = encodingSerialize.GetString(memoryStreamSerialize.ToArray());

        xmlTextWriterSerialize.Close();
        memoryStreamSerialize.Close();
        memoryStreamSerialize.Dispose();

        // deserialize xml to object

        // converts a string to a UTF-8 byte array.
        UTF8Encoding encodingDeserialize = new UTF8Encoding();
        byte[] byteArray = encodingDeserialize.GetBytes(serializedXml);

        using (MemoryStream memoryStreamDeserialize = new MemoryStream(byteArray))
        {
            XmlSerializer xmlSerializerDeserialize = new XmlSerializer(typeof(SpecialCharacters));
            XmlTextWriter xmlTextWriterDeserialize = new XmlTextWriter(memoryStreamDeserialize, Encoding.UTF8);

            SpecialCharacters deserializedObject = (SpecialCharacters)xmlSerializerDeserialize.Deserialize(xmlTextWriterDeserialize.BaseStream);
        }
    }
}

[Serializable]
public class SpecialCharacters
{
    public string Umlaute { get; set; }
}

I personally use the follwing generic methods to serialize and deserialize XML and objects and haven't had any performance or encoding issues yet.

public static string SerializeObjectToXml<T>(T obj)
{
    MemoryStream memoryStream = new MemoryStream();
    XmlSerializer xmlSerializer = new XmlSerializer(typeof(T));
    XmlTextWriter xmlTextWriter = new XmlTextWriter(memoryStream, Encoding.UTF8);

    xmlSerializer.Serialize(xmlTextWriter, obj);
    memoryStream = (MemoryStream)xmlTextWriter.BaseStream;

    string xmlString = ByteArrayToStringUtf8(memoryStream.ToArray());

    xmlTextWriter.Close();
    memoryStream.Close();
    memoryStream.Dispose();

    return xmlString;
}

public static T DeserializeXmlToObject<T>(string xml)
{
    using (MemoryStream memoryStream = new MemoryStream(StringToByteArrayUtf8(xml)))
    {
        XmlSerializer xmlSerializer = new XmlSerializer(typeof(T));

        using (StreamReader xmlStreamReader = new StreamReader(memoryStream, Encoding.UTF8))
        {
            return (T)xmlSerializer.Deserialize(xmlStreamReader);
        }
    }
}

public static string ByteArrayToStringUtf8(byte[] value)
{
    UTF8Encoding encoding = new UTF8Encoding();
    return encoding.GetString(value);
}

public static byte[] StringToByteArrayUtf8(string value)
{
    UTF8Encoding encoding = new UTF8Encoding();
    return encoding.GetBytes(value);
}
Humpage answered 4/2, 2011 at 19:9 Comment(2)
Hmm. Wrapping a stream in an XmlTextWriter then passing writer.BaseStream seems like you could just pass the stream without the XmlTextWriter. Especially since Deserialize wants an XmlReader, not a writer, if you are going to go that route.Armyn
@JesseChisholm You are right, that makes total sense. I also found the implementation with StreamReader a couple ticks faster.Humpage
A
2

What works for me is similar to what @martin-buberl suggested:

public static T DeserializeXmlToObject<T>(string xml)
{
    using (MemoryStream memoryStream = new MemoryStream(Encoding.UTF8.GetBytes(xml)))
    {
        XmlSerializer xmlSerializer = new XmlSerializer(typeof(T));
        StreamReader reader = new StreamReader(memoryStream, Encoding.UTF8);
        return (T)xmlSerializer.Deserialize(reader);
    }
}
Armyn answered 16/8, 2012 at 22:41 Comment(0)
T
0

The simplest way of doing this is to transform the characters from any encoding to the Base64 encoding. The Base64 transforms any string in a list of printable characters, thus removing the need to do "5000 conversions".

Serializable_Class class = new Serializable_Class();

string xml_string1 = SOME_XML1;
string xml_string2 = SOME_XML2;
string xml_string3 = SOME_XML3;

class .item1 = Convert.ToBase64String(Encoding.UTF8.GetBytes(xml_string1));
class .item2 = Convert.ToBase64String(Encoding.UTF8.GetBytes(xml_string2));
class .item3 = Convert.ToBase64String(Encoding.UTF8.GetBytes(xml_string3));

System.IO.MemoryStream payload_stream = new System.IO.MemoryStream();

System.Xml.Serialization.XmlSerializer payload_generator = new System.Xml.Serialization.XmlSerializer(class.GetType());

payload_generator.Serialize(payload_stream, class);

byte[] serialised_class = payload_stream.ToArray();

payload_stream.Close();
payload_stream.Dispose();

The strings of the class that will be serialised must be converted to Base64 strings. A MemoryStream object must be initiated in order to manipulate the binary information of the serialisation process in memory. Then, an XmlSerializer object must be created to serialise the object using the Serialize() method. The MemoryStream object must be passed as a parameter in order for the XmlSerializer to manipulate the data in memory. After the serialisation is finished, the serialised object can be extracted from the binary data of the MemoryStream by calling the ToArray() method to get all the binary information within the MemoryStream as a byte array.

Type answered 23/2, 2023 at 19:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.