HtmlAgilityPack SelectNodes expression to ignore an element with a certain attribute
Asked Answered
P

2

6

I am trying to select nodes except from script nodes and a ul that has a class called 'relativeNav'. Can someone please direct me to the right path? I have been searching for this for a week and I can't find it anywhere. Currently I have this but it obviously selecting the //ul[@class='relativeNav'] as well. Is there anyway to put an NOT expression of it so that SelectNode will ignore that one?

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//body//*[not(self::script)]/text()"))
        {
            Console.WriteLine("Node: " + node);
            singleString += node.InnerText.Trim() + "\n";
        }
Paean answered 5/11, 2012 at 3:7 Comment(0)
A
4

Given an Html document with a structure similar to:

<html>
<head><title>HtmlDocument</title>
</head>
<body>
<div>
<span>Hello Span World</span>
<script>
Script Text
</script>
</div>
<ul class='relativeNav'>
<li>Hello </li>
<li>Li</li>
<li>World</li>
</ul>
</body>
</html>

The following XPath expression will select all nodes which are not script elements excluding all children of UL elements with class 'relativeNav':

var nodes = htmlDoc.DocumentNode.SelectNodes("//body//*[not(parent::ul[@class='relativeNav']) and not(self::script)]/text()");

Update: forgot to mention that if you need to exclude any children of ul[class='relativeNav'] irrespective of their depth you should use:

"//body//*[not(ancestor::ul[@class='relativeNav']) and not(self::script)]/text()"

If you wanted to exclude the ul element as well (somewhat irrelevant in the example above since the element does not contain text) you should specify:

"//body//*[not(ancestor-or-self::ul[@class='relativeNav']) and not(self::script)]"
Antipodes answered 5/11, 2012 at 7:59 Comment(1)
Your answer was exactly what I was looking for. Thanks for shedding some light on XPath.Paean
L
2

I hope this is what you need:

HtmlDocument doc = new HtmlDocument();
var nodesToExclude1 = doc.DocumentNode.SelectNodes("//ul[@class='relativeNav']");
var nodesToExclude2 = doc.DocumentNode.SelectNodes("//body//script");
var requiredNodes = doc.DocumentNode.SelectNodes("//")
                       .Where(node => !nodesToExclude1.Contains(node) &&
                                      !nodesToExclude2.Contains(node));

foreach (HtmlNode node in requiredNodes)
{
    Console.WriteLine("Node: " + node);
    singleString += node.InnerText.Trim() + "\n";
}
Ligure answered 5/11, 2012 at 3:30 Comment(2)
It gave an "XPathException: Expression must evaluate to a node-set" when i use this "var requiredNodes = doc.DocumentNode.SelectNodes("//").Where(node => !nodesToExclude.Contains(node));". Plus I have two other requirements of only to select the "//body" and not to select the script "//*[not(self::script)]/text()" as well. It gave me a null object Exception when I put them under the SelectNodes of requiredNodes. "var requiredNodes = doc.DocumentNode.SelectNodes("//body//*[not(self::script)]/text()").Where(node => !nodesToExclude.Contains(node));"Paean
Thanks. Linq expression will come in handy for me in the future.Paean

© 2022 - 2024 — McMap. All rights reserved.