How can I write out decoded HTML using HTMLAgilityPack?
Asked Answered
P

2

11

I am having partial success in my attempt to write HTML to a DOCX file using HTMLAgilityPack and the DOCX library. However, the text I'm inserting into the .docx file contains encoded html such as:

La ciudad de Los Ángeles (California) ha sincronizado su red completa de semáforos —casi 4.500—, que cubre una zona de 1.215 kilómetros cuadrados (469 millas cuadradas). Según el diario

What I want it to be is more like this:

La ciudad de Los Angeles (California) ha sincronizado su red completa de semaforos - casi 4.500 -, que cubre una zona de 1.215 kilometros cuadrados (469 millas
cuadradas). Segun el diario

To show some context, this is the code I'm using:

private void ParseHTMLAndConvertBackToDOCX()
{
    List<string> sourceText = new List<string>();
    List<string> targetText = new List<string>();
    HtmlAgilityPack.HtmlDocument htmlDocSource = new HtmlAgilityPack.HtmlDocument();
    HtmlAgilityPack.HtmlDocument htmlDocTarget = new HtmlAgilityPack.HtmlDocument();

    // There are various options, set as needed
    htmlDocSource.OptionFixNestedTags = true;
    htmlDocTarget.OptionFixNestedTags = true;

    htmlDocSource.Load(sourceHTMLFilename);
    htmlDocTarget.Load(targetHTMLFilename);

    // Popul8 generic list of string with source text lines
    if (htmlDocSource.DocumentNode != null)
    {
        IEnumerable<HtmlAgilityPack.HtmlNode> pNodes = htmlDocSource.DocumentNode.SelectNodes("//text()");

        foreach (HtmlNode sText in pNodes)
        {
            if (!string.IsNullOrWhiteSpace(sText.InnerText))
            {
                sourceText.Add(sText.InnerText);
            }
        }
    }

. . .

The most pertinent line is no doubt:

sourceText.Add(sText.InnerText);

Should it be something other than InnerText?

Is it possible to to something like:

sourceText.Add(sText.InnerText.Decode());

?

Intellisense is not working with this, even though the project compiles and runs; trying to see what other options there are besides InnerText for HTMLNode is thus fruitless; I know there's OuterText, InnerHTML, and OuterHMTL, though...

Piddling answered 18/2, 2014 at 1:53 Comment(0)
C
6

Try with:

sourceText.Add(HttpUtility.HtmlDecode(myEncodedString));

Examples

Convenient answered 18/2, 2014 at 2:11 Comment(1)
Thanks; just had to add a references to System.WebPiddling
S
21

You can use HtmlEntity.DeEntitize(sText.InnerText) from HTMLAgilityPack.

Snoop answered 20/11, 2014 at 9:44 Comment(1)
I prefer this answer because no need of other code than HtmlAgilityPack.Eggplant
C
6

Try with:

sourceText.Add(HttpUtility.HtmlDecode(myEncodedString));

Examples

Convenient answered 18/2, 2014 at 2:11 Comment(1)
Thanks; just had to add a references to System.WebPiddling

© 2022 - 2024 — McMap. All rights reserved.