I'm using PHP DOM and I'm trying to get an element within a DOM node that have a given class name. What's the best way to get that sub-element?
Update: I ended up using Mechanize
for PHP which was much easier to work with.
I'm using PHP DOM and I'm trying to get an element within a DOM node that have a given class name. What's the best way to get that sub-element?
Update: I ended up using Mechanize
for PHP which was much easier to work with.
Update: Xpath version of *[@class~='my-class']
css selector
So after my comment below in response to hakre's comment, I got curious and looked into the code behind Zend_Dom_Query
. It looks like the above selector is compiled to the following xpath (untested):
[contains(concat(' ', normalize-space(@class), ' '), ' my-class ')]
So the PHP would be:
$dom = new DomDocument();
$dom->load($filePath);
$finder = new DomXPath($dom);
$classname="my-class";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
Basically, all we do here is normalize the class
attribute so that even a single class is bounded by spaces, and the complete class list is bounded in spaces. Then append the class we are searching for with a space. This way we are effectively looking for and find only instances of my-class
.
Use an xpath selector?
$dom = new DomDocument();
$dom->load($filePath);
$finder = new DomXPath($dom);
$classname="my-class";
$nodes = $finder->query("//*[contains(@class, '$classname')]");
If it is only ever one type of element you can replace the *
with the particular tagname.
If you need to do a lot of this with very complex selector I would recommend Zend_Dom_Query
which supports CSS selector syntax (a la jQuery):
$finder = new Zend_Dom_Query($html);
$classname = 'my-class';
$nodes = $finder->query("*[class~=\"$classname\"]");
my-class2
as well, but pretty sweet. Any way to only pick the first of all elements? –
Electroplate class
can have more than one class for example: <a class="my-link link-button nav-item">
. –
Squashy //*[contains(concat(' ', normalize-space(@class), ' '), ' classname ')]
(Very informative: CSS Selectors And XPath Expressions). –
Electroplate classname
. GOOD LINK. Whis i had found that instead o reading the code in Zend_Dom_Query
... would have been faster, haha. –
Squashy contains
in combination with concat
... we are jsut discussing the particulars of padding the spaces on both sides of the class youre searching for or only padding one side. Either should work though. –
Squashy If you wish to get the innerhtml of the class without the zend you could use this:
$dom = new DomDocument();
$dom->load($filePath);
$classname = 'main-article';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
$tmp_dom = new DOMDocument();
foreach ($nodes as $node)
{
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML.=trim($tmp_dom->saveHTML());
echo $innerHTML;
I think the accepted way is better, but I guess this might work as well
function getElementByClass(&$parentNode, $tagName, $className, $offset = 0) {
$response = false;
$childNodeList = $parentNode->getElementsByTagName($tagName);
$tagCount = 0;
for ($i = 0; $i < $childNodeList->length; $i++) {
$temp = $childNodeList->item($i);
if (stripos($temp->getAttribute('class'), $className) !== false) {
if ($tagCount == $offset) {
$response = $temp;
break;
}
$tagCount++;
}
}
return $response;
}
$classResult = getElementByClass($dom, 'div', 'm-signature-pad'); $classResult->nodeValue = ''; $enode = $dom->createElement('img'); $enode->setAttribute('src', $signatureImage); $classResult->appendChild($enode);
–
Commendation There is also another approach without the use of DomXPath
or Zend_Dom_Query
.
Based on dav's original function, I wrote the following function that returns all the children of the parent node whose tag and class match the parameters.
function getElementsByClass(&$parentNode, $tagName, $className) {
$nodes=array();
$childNodeList = $parentNode->getElementsByTagName($tagName);
for ($i = 0; $i < $childNodeList->length; $i++) {
$temp = $childNodeList->item($i);
if (stripos($temp->getAttribute('class'), $className) !== false) {
$nodes[]=$temp;
}
}
return $nodes;
}
suppose you have a variable $html
the following HTML:
<html>
<body>
<div id="content_node">
<p class="a">I am in the content node.</p>
<p class="a">I am in the content node.</p>
<p class="a">I am in the content node.</p>
</div>
<div id="footer_node">
<p class="a">I am in the footer node.</p>
</div>
</body>
</html>
use of getElementsByClass
is as simple as:
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
$content_node=$dom->getElementById("content_node");
$div_a_class_nodes=getElementsByClass($content_node, 'div', 'a');//will contain the three nodes under "content_node".
DOMDocument is slow to type and phpQuery has bad memory leak issues. I ended up using:
https://github.com/wasinger/htmlpagedom
To select a class:
include 'includes/simple_html_dom.php';
$doc = str_get_html($html);
$href = $doc->find('.lastPage')[0]->href;
I hope this helps someone else as well
PHP's native DOM handling is so absurdly bad, do yourself a favour and use this or any other modern HTML parsing package which can handle this within in few lines:
Install paquettg/php-html-parser with
composer require paquettg/php-html-parser
Then create a .php file in the same folder with this content
<?php
// load dependencies via Composer
require __DIR__ . '/vendor/autoload.php';
use PHPHtmlParser\Dom;
$dom = new Dom;
$dom->loadFromUrl("https://example.com");
$links = $dom->find('.classname a');
foreach ($links as $link) {
echo $link->getAttribute('href');
}
P.S. You'll find information on how to install Composer on Composer's homepage.
I prefer using Symfony for this. Their libraries are pretty nice.
Use the The DomCrawler Component
Example:
$browser = new HttpBrowser(HttpClient::create());
$crawler = $browser->request('GET', 'example.com');
$class = $crawler->filter('.class')->first();
© 2022 - 2024 — McMap. All rights reserved.