Grab all text from html with Html Agility Pack
Asked Answered
D

9

40

Input

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

Output

foo
bar
baz

I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.

Desecrate answered 15/11, 2010 at 8:25 Comment(0)
S
13
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        string text = node.InnerText;
        if (!string.IsNullOrEmpty(text))
            sb.AppendLine(text.Trim());
    }
}

This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.

Stonyhearted answered 15/11, 2010 at 9:15 Comment(0)
I
76

XPATH is your friend :)

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    Console.WriteLine("text=" + node.InnerText);
}
Impenitent answered 21/11, 2010 at 9:50 Comment(3)
This worked brilliantly for me. Everything I threw at it, even crappy html fragments generated by an old CMS.Elan
Nice. Below is a small modification that will also handle the scenario where there is no text (thus avoiding a run-time exception). HtmlNodeCollection textNodes = doc.DocumentNode.SelectNodes("//text()"); if (textNodes != null) foreach (HtmlNode node in textNodes) result += node.InnerText;Mixologist
@Mixologist Just what I neededFlathead
S
13
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        string text = node.InnerText;
        if (!string.IsNullOrEmpty(text))
            sb.AppendLine(text.Trim());
    }
}

This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.

Stonyhearted answered 15/11, 2010 at 9:15 Comment(0)
H
13

I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:

StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n => 
    n.NodeType == HtmlNodeType.Text &&
    n.ParentNode.Name != "script" &&
    n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
    Console.WriteLine(node.InnerText);
Helladic answered 24/9, 2014 at 16:20 Comment(1)
Love this solution, it also strips the CSS and Scripts :-)Skimmer
A
11
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;

The specified example for html content:

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

will produce the following output:

foo bar baz
Aldin answered 12/12, 2014 at 10:29 Comment(1)
this will also make css part of pageText and is in my case not desiredAcroterion
P
5
public string html2text(string html) {
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(@"<html><body>" + html + "</body></html>");
    return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}

This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).

Pouch answered 11/11, 2015 at 16:29 Comment(1)
If your html parameter had a <b> tag, it would convert the html to a line break (\n) when converting it to text, which is not correct.Suh
S
0

https://github.com/jamietre/CsQuery

have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.

var text = CQ.CreateDocument(htmlText).Text();

Here's a complete console application:

using System;
using CsQuery;

public class Program
{
    public static void Main()
    {
        var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
        var text = CQ.CreateDocument(html).Text();
        Console.WriteLine(text); // Output: Hello World  some text inside h1 tag under p tag

    }
}

I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!

Somnambulation answered 23/10, 2020 at 10:28 Comment(0)
A
0

I just changed and fixed some people's answers to work better:

var document = new HtmlDocument();
        document.LoadHtml(result);
        var sb = new StringBuilder();
        foreach (var node in document.DocumentNode.DescendantsAndSelf())
        {
            if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
            {
                string text = node.InnerText?.Trim();
                if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
                    sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
            }
        }
Abusive answered 3/11, 2022 at 12:16 Comment(0)
B
0

Possibly something like the below (I found the very basic version while googling and extended it to handle hyperlinks, ul, ol, divs, tables)

        /// <summary>
    /// Static class that provides functions to convert HTML to plain text.
    /// </summary>
    public static class HtmlToText {

        #region Method: ConvertFromFile (public - static)
        /// <summary>
        /// Converts the HTML content from a given file path to plain text.
        /// </summary>
        /// <param name="path">The path to the HTML file.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromFile(string path) {
            var doc = new HtmlDocument();

            // Load the HTML file
            doc.Load(path);

            using (var sw = new StringWriter()) {
                // Convert the HTML document to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertFromString (public - static)
        /// <summary>
        /// Converts the given HTML string to plain text.
        /// </summary>
        /// <param name="html">The HTML content as a string.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromString(string html) {
            var doc = new HtmlDocument();

            // Load the HTML string
            doc.LoadHtml(html);

            using (var sw = new StringWriter()) {
                // Convert the HTML string to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertTo (static)
        /// <summary>
        /// Helper method to convert each child node of the given node to text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        /// <param name="counters">Keep track of the ol/li counters during conversion</param>
        private static void ConvertContentTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            // Convert each child node to text
            foreach (var subnode in node.ChildNodes) {
                ConvertTo(subnode, outText, counters);
            }
        }
        #endregion

        #region Method: ConvertTo (public - static)
        /// <summary>
        /// Converts the given HTML node to plain text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        public static void ConvertTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            string html;

            switch (node.NodeType) {
                case HtmlNodeType.Comment:
                    // Don't output comments
                    break;
                case HtmlNodeType.Document:
                    // Convert entire content of document node to text
                    ConvertContentTo(node, outText, counters);
                    break;
                case HtmlNodeType.Text:
                    // Ignore script and style nodes
                    var parentName = node.ParentNode.Name;
                    if ((parentName == "script") || (parentName == "style")) {
                        break;
                    }

                    // Get text from the text node
                    html = ((HtmlTextNode)node).Text;

                    // Ignore special closing nodes output as text
                    if (HtmlNode.IsOverlappedClosingElement(html) || string.IsNullOrWhiteSpace(html)) {
                        break;
                    }

                    // Write meaningful text (not just white-spaces) to the output
                    outText.Write(HtmlEntity.DeEntitize(html));
                    break;
                case HtmlNodeType.Element:
                    switch (node.Name.ToLowerInvariant()) {
                        case "p":
                        case "div":
                        case "br":
                        case "table":
                            // Treat paragraphs and divs as new lines
                            outText.Write("\n");
                            break;
                        case "li":
                            // Treat list items as dash-prefixed lines
                            if (node.ParentNode.Name == "ol") {
                                if (!counters.ContainsKey(node.ParentNode)) {
                                    counters[node.ParentNode] = 0;
                                }
                                counters[node.ParentNode]++;
                                outText.Write("\n" + counters[node.ParentNode] + ". ");
                            } else {
                                outText.Write("\n- ");
                            }
                            break;
                        case "a":
                            // convert hyperlinks to include the URL in parenthesis
                            if (node.HasChildNodes) {
                                ConvertContentTo(node, outText, counters);
                            }
                            if (node.Attributes["href"] != null) {
                                outText.Write($" ({node.Attributes["href"].Value})");
                            }
                            break;
                        case "th":
                        case "td":
                            outText.Write(" | ");
                            break;
                    }

                    // Convert child nodes to text if they exist (ignore a href children as they are already handled)
                    if (node.Name.ToLowerInvariant() != "a" && node.HasChildNodes) {
                        ConvertContentTo(node: node,
                                         outText: outText,
                                         counters: counters);
                    }
                    break;
            }
        }
        #endregion

    } // class: HtmlToText 
Barny answered 21/6, 2023 at 8:9 Comment(0)
E
0
string Body = htmlDocument.DocumentNode.SelectSingleNode("//body").InnerText;

Then you need to clean up the text and remove excessive whitespace and so on.

Excommunicative answered 23/2 at 0:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.