Converting HTML entities to Unicode Characters in C#
Asked Answered
C

6

46

I found similar questions and answers for Python and Javascript, but not for C# or any other WinRT compatible language.

The reason I think I need it, is because I'm displaying text I get from websites in a Windows 8 store app. E.g. é should become é.

Or is there a better way? I'm not displaying websites or rss feeds, but just a list of websites and their titles.

Calaboose answered 21/11, 2012 at 11:40 Comment(3)
Duplicate: #5784317Latrena
Actually it's not. He had a different issue.Calaboose
It is indeed a duplicate. That question just had an extra step at the end that you don't need.Cacus
G
83

I recommend using System.Net.WebUtility.HtmlDecode and NOT HttpUtility.HtmlDecode.

This is due to the fact that the System.Web reference does not exist in Winforms/WPF/Console applications and you can get the exact same result using this class (which is already added as a reference in all those projects).

Usage:

string s =  System.Net.WebUtility.HtmlDecode("é"); // Returns é
Gassman answered 21/11, 2012 at 11:57 Comment(4)
"you can get the exact same result using this class" - INCORRECT. Only the HttpUtility implementation will correctly decode ' as an apostrophe on WP8.Serotonin
In my case, HttpUtility.HtmlDecoded do the right thing.Marabout
Good solution, but the disadvantage with System.Net.WebUtility.HtmlDecode is you won't find it under .NET Framework 3.5 if you are coding for old Windows 7.Wakerly
link is broken.Wrong
R
13

Use HttpUtility.HtmlDecode() .Read on msdn here

decodedString = HttpUtility.HtmlDecode(myEncodedString)
Retinite answered 21/11, 2012 at 11:43 Comment(3)
Yep, note that for WinForms or Console application you first have to add reference to the System.Web assembly.Referendum
Hi, I tried this solution but it doesn't decode characters like { :(Intoxicated
@l19 Is that a recognized htmlentity? I can't find it in this list. I did manage to find it in a developmental W3C spec, though. That's probably why it isn't decoded yet.Leralerch
B
11

This might be useful, replaces all (for as far as my requirements go) entities with their unicode equivalent.

    public string EntityToUnicode(string html) {
        var replacements = new Dictionary<string, string>();
        var regex = new Regex("(&[a-z]{2,5};)");
        foreach (Match match in regex.Matches(html)) {
            if (!replacements.ContainsKey(match.Value)) { 
                var unicode = HttpUtility.HtmlDecode(match.Value);
                if (unicode.Length == 1) {
                    replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
                }
            }
        }
        foreach (var replacement in replacements) {
            html = html.Replace(replacement.Key, replacement.Value);
        }
        return html;
    }
Burundi answered 1/7, 2014 at 16:34 Comment(2)
Work for my case, but I edited the regex for "var regex = new Regex("(&[a-z]{2,6};)");" There is a lot of html character longer than 5 (like $eacute; )Wedurn
I'd also suggest changing the regex to var regex = new Regex("(&[a-zA-Z]{2,7};)");so that characters such as &Atilde; are included.Elana
N
3

Different coding/encoding of HTML entities and HTML numbers in Metro App and WP8 App.

With Windows Runtime Metro App

{
    string inStr = "ó";
    string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
    // auxStr == &#243;
    string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
    // outStr == ó
    string outStr2 = System.Net.WebUtility.HtmlDecode("&oacute;");
    // outStr2 == ó
}

With Windows Phone 8.0

{
    string inStr = "ó";
    string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
    // auxStr == &#243;
    string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
    // outStr == &#243;
    string outStr2 = System.Net.WebUtility.HtmlDecode("&oacute;");
    // outStr2 == ó
}

To solve this, in WP8, I have implemented the table in HTML ISO-8859-1 Reference before calling System.Net.WebUtility.HtmlDecode().

Negotiable answered 5/2, 2013 at 9:15 Comment(1)
The link is dead.Pintail
A
2

This worked for me, replaces both common and unicode entities.

private static readonly Regex HtmlEntityRegex = new Regex("&(#)?([a-zA-Z0-9]*);");

public static string HtmlDecode(this string html)
{
    if (html.IsNullOrEmpty()) return html;
    return HtmlEntityRegex.Replace(html, x => x.Groups[1].Value == "#"
        ? ((char)int.Parse(x.Groups[2].Value)).ToString()
        : HttpUtility.HtmlDecode(x.Groups[0].Value));
}

[Test]
[TestCase(null, null)]
[TestCase("", "")]
[TestCase("&#39;fark&#39;", "'fark'")]
[TestCase("&quot;fark&quot;", "\"fark\"")]
public void should_remove_html_entities(string html, string expected)
{
    html.HtmlDecode().ShouldEqual(expected);
}
Alboin answered 29/9, 2016 at 18:53 Comment(0)
F
1

Improved Zumey method (I can`t comment there). Max char size is in the entity: &exclamation; (11). Upper case in the entities are also possible, ex. À (Source from wiki)

public string EntityToUnicode(string html) {
        var replacements = new Dictionary<string, string>();
        var regex = new Regex("(&[a-zA-Z]{2,11};)");
        foreach (Match match in regex.Matches(html)) {
            if (!replacements.ContainsKey(match.Value)) { 
                var unicode = HttpUtility.HtmlDecode(match.Value);
                if (unicode.Length == 1) {
                    replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
                }
            }
        }
        foreach (var replacement in replacements) {
            html = html.Replace(replacement.Key, replacement.Value);
        }
        return html;
    }
Firebird answered 25/9, 2018 at 13:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.