Unescaping XML entities using XmlReader in .NET?

A

5

11

I'm trying to unescape XML entities in a string in .NET (C#), but I don't seem to get it to work correctly.

For example, if I have the string AT&T, it should be translated to AT&T.

One way is to use HttpUtility.HtmlDecode(), but that's for HTML.

So I have two questions about this:

Is it safe to use HttpUtility.HtmlDecode() for decoding XML entities?

How do I use XmlReader (or something similar) to do this? I have tried the following, but that always returns an empty string:

static string ReplaceEscapes(string text)
{
    StringReader reader = new StringReader(text);

    XmlReaderSettings settings = new XmlReaderSettings();

    settings.ConformanceLevel = ConformanceLevel.Fragment;

    using (XmlReader xmlReader = XmlReader.Create(reader, settings))
    {
        return xmlReader.ReadString();
    }
}

Angelikaangelina answered 14/3, 2011 at 20:47 Comment(0)

G

8

Your #2 solution can work, but you need to call xmlReader.Read(); (or xmlReader.MoveToContent();) prior to ReadString.

I guess #1 would be also acceptable, even though there are those edge cases like ® which is a valid HTML entity, but not an XML entity – what should your unescaper do with it? Throw an exception as a proper XML parser, or just return “®” as the HTML parser would do?

Greek answered 14/3, 2011 at 21:31 Comment(1)

Adding xmlReader.MoveToContent() did the trick, and that's exactly the solution I was looking for. I didn't really want to use HttpUtility because of the differences between HTML and XML, so your response was extremely helpful. – Angelikaangelina 14/3, 2011 at 21:42

S

17

HTML escaping and XML are closely related. as you have said, HttpUtility has both HtmlEncode and HtmlDecode methods. These will also operate on XML, as there are only a few entities that need escaping: <,>,\,' and & in both HTML and XML.

The downside of using the HttpUtility class is that you need a reference to the System.Web dll, which also brings in a lot of other stuff that you probably don't want.

Specifically for XML, the SecurityElement class has an Escape method that will do the encoding, but does not have a corresponding Unescape method. You therefore have a few options:

use the HttpUtility.HtmlDecode() and put up with a reference to System.Web
roll your own decode method that takes care of the special characters (as there are only a handful - look at the static constructor of SecurityElement in Reflector to see the full list)
use a (hacky) solution like:

.

    public static string Unescape(string text)
    {
        XmlDocument doc = new XmlDocument();
        string xml = string.Format("<dummy>{0}</dummy>", text);
        doc.LoadXml(xml);
        return doc.DocumentElement.InnerText;
    }

Personally, I would use HttpUtility.HtmlDecode() if I already had a reference to System.Web, or roll my own if not. I don't like your XmlReader approach as it is Disposable, which usually indicate that it is using resources that need to be disposed, and so may be a costly operation.

Socher answered 14/3, 2011 at 21:31 Comment(0)

G

8

Your #2 solution can work, but you need to call xmlReader.Read(); (or xmlReader.MoveToContent();) prior to ReadString.

I guess #1 would be also acceptable, even though there are those edge cases like ® which is a valid HTML entity, but not an XML entity – what should your unescaper do with it? Throw an exception as a proper XML parser, or just return “®” as the HTML parser would do?

Greek answered 14/3, 2011 at 21:31 Comment(1)

Adding xmlReader.MoveToContent() did the trick, and that's exactly the solution I was looking for. I didn't really want to use HttpUtility because of the differences between HTML and XML, so your response was extremely helpful. – Angelikaangelina 14/3, 2011 at 21:42

T

1

This works:

using (XmlReader xmlReader = XmlReader.Create(reader, settings))
{
    if (xmlReader.Read())
    {
       return xmlReader.ReadString();
    }
}

Throughout answered 14/3, 2011 at 21:41 Comment(0)

Q

1

I found that the top answer has a small bug if your input text ends with certain white space characters, like carriage returns.

The string "Testing
" loses it's trailing white space.

If you combine the solution in the question with adrianbanks' wrapper tag you get the following, which works.

public static string UnescapeUnicode(string line)
    {
        using (StringReader reader = new StringReader("<a>" + line + "</a>"))
        {
            using (XmlReader xmlReader = XmlReader.Create(reader))
            {
                xmlReader.MoveToContent();
                return xmlReader.ReadElementContentAsString();
            }
        }
    }

Quadriplegia answered 25/5, 2012 at 15:23 Comment(0)

U

1

This works as well, and has least code:

    public static string DecodeString(string encodedString)
    {
        if (string.IsNullOrEmpty(formattedText))
            return string.Empty;
        XmlTextReader xtr = new XmlTextReader(encodedString, XmlNodeType.Element, null);
        if (xtr.Read())
            return xtr.ReadString();
        throw new Exception("Error decoding xml string : " + encodedString);
    }

Update1: hmm, seems it does not work if encodeString is "", then xtr.Read() return false.

Update2: added workaround

Update3: this seem to work even better

    public static string DecodeString(string encodedString)
    {
        XmlTextReader xtr = new XmlTextReader(encodedString, XmlNodeType.Element, null);
        xtr.MoveToContent();
        return xtr.Value;
    }

Unheardof answered 10/3, 2016 at 14:23 Comment(0)

Recommended topics

Hot tags