Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText
, but it will give foobarbaz
- I want to get each text, not all at a time.
Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText
, but it will give foobarbaz
- I want to get each text, not all at a time.
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz
public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack
).
https://github.com/jamietre/CsQuery
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
var text = CQ.CreateDocument(htmlText).Text();
Here's a complete console application:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!
I just changed and fixed some people's answers to work better:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}
Possibly something like the below (I found the very basic version while googling and extended it to handle hyperlinks, ul, ol, divs, tables)
/// <summary>
/// Static class that provides functions to convert HTML to plain text.
/// </summary>
public static class HtmlToText {
#region Method: ConvertFromFile (public - static)
/// <summary>
/// Converts the HTML content from a given file path to plain text.
/// </summary>
/// <param name="path">The path to the HTML file.</param>
/// <returns>The plain text version of the HTML content.</returns>
public static string ConvertFromFile(string path) {
var doc = new HtmlDocument();
// Load the HTML file
doc.Load(path);
using (var sw = new StringWriter()) {
// Convert the HTML document to plain text
ConvertTo(node: doc.DocumentNode,
outText: sw,
counters: new Dictionary<HtmlNode, int>());
sw.Flush();
return sw.ToString();
}
}
#endregion
#region Method: ConvertFromString (public - static)
/// <summary>
/// Converts the given HTML string to plain text.
/// </summary>
/// <param name="html">The HTML content as a string.</param>
/// <returns>The plain text version of the HTML content.</returns>
public static string ConvertFromString(string html) {
var doc = new HtmlDocument();
// Load the HTML string
doc.LoadHtml(html);
using (var sw = new StringWriter()) {
// Convert the HTML string to plain text
ConvertTo(node: doc.DocumentNode,
outText: sw,
counters: new Dictionary<HtmlNode, int>());
sw.Flush();
return sw.ToString();
}
}
#endregion
#region Method: ConvertTo (static)
/// <summary>
/// Helper method to convert each child node of the given node to text.
/// </summary>
/// <param name="node">The HTML node to convert.</param>
/// <param name="outText">The writer to output the text to.</param>
/// <param name="counters">Keep track of the ol/li counters during conversion</param>
private static void ConvertContentTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
// Convert each child node to text
foreach (var subnode in node.ChildNodes) {
ConvertTo(subnode, outText, counters);
}
}
#endregion
#region Method: ConvertTo (public - static)
/// <summary>
/// Converts the given HTML node to plain text.
/// </summary>
/// <param name="node">The HTML node to convert.</param>
/// <param name="outText">The writer to output the text to.</param>
public static void ConvertTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
string html;
switch (node.NodeType) {
case HtmlNodeType.Comment:
// Don't output comments
break;
case HtmlNodeType.Document:
// Convert entire content of document node to text
ConvertContentTo(node, outText, counters);
break;
case HtmlNodeType.Text:
// Ignore script and style nodes
var parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style")) {
break;
}
// Get text from the text node
html = ((HtmlTextNode)node).Text;
// Ignore special closing nodes output as text
if (HtmlNode.IsOverlappedClosingElement(html) || string.IsNullOrWhiteSpace(html)) {
break;
}
// Write meaningful text (not just white-spaces) to the output
outText.Write(HtmlEntity.DeEntitize(html));
break;
case HtmlNodeType.Element:
switch (node.Name.ToLowerInvariant()) {
case "p":
case "div":
case "br":
case "table":
// Treat paragraphs and divs as new lines
outText.Write("\n");
break;
case "li":
// Treat list items as dash-prefixed lines
if (node.ParentNode.Name == "ol") {
if (!counters.ContainsKey(node.ParentNode)) {
counters[node.ParentNode] = 0;
}
counters[node.ParentNode]++;
outText.Write("\n" + counters[node.ParentNode] + ". ");
} else {
outText.Write("\n- ");
}
break;
case "a":
// convert hyperlinks to include the URL in parenthesis
if (node.HasChildNodes) {
ConvertContentTo(node, outText, counters);
}
if (node.Attributes["href"] != null) {
outText.Write($" ({node.Attributes["href"].Value})");
}
break;
case "th":
case "td":
outText.Write(" | ");
break;
}
// Convert child nodes to text if they exist (ignore a href children as they are already handled)
if (node.Name.ToLowerInvariant() != "a" && node.HasChildNodes) {
ConvertContentTo(node: node,
outText: outText,
counters: counters);
}
break;
}
}
#endregion
} // class: HtmlToText
string Body = htmlDocument.DocumentNode.SelectSingleNode("//body").InnerText;
Then you need to clean up the text and remove excessive whitespace and so on.
© 2022 - 2024 — McMap. All rights reserved.