Get the text from all elements with a nominated class as a flat array
Asked Answered
I

7

28

I know we can use PHP DOM to parse HTML using PHP, but I have a specific requirement. I have an HTML content like below

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

$heading and $content

$heading = array('Chapter 1', 'Chapter 2', 'Chapter 3');
$content = array('This is chapter 1', 'This is chapter 2', 'This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way.

Indignity answered 21/8, 2013 at 4:55 Comment(6)
use jquery as its structure is simple.Quintuplet
@Susheel: HTML content will be much bigger as it is the output after parsing docx filesIndignity
You could use regular expressions if you don't like to go for PHP DOM.Dupery
@LorenzMeyer do not use regular expressions to parse htmlManicure
@blessed for bigger dom use php simple dom parserQuintuplet
@blessed, I have added solution at: https://mcmap.net/q/489085/-get-the-text-from-all-elements-with-a-nominated-class-as-a-flat-arrayEscharotic
U
24

Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}
Usurious answered 21/8, 2013 at 4:58 Comment(7)
!!NOTICE!! not using "->innertext" leads to memory leaks.Zebe
This is a much easier option and produces more readable code compared to using DomDocument.Alarise
Is there an option to install that with composer?Watanabe
Composer install is now possible: composer require simplehtmldom/simlehtmldom dev-master and use simplehtmldom\HtmlWeb;Watanabe
@Watanabe there is a typo in your comment. missing the "p" in the second "simple" in the composer require commandTesch
@Tesch yeah, that typo is in the linked official source. The corrected version would then be like this: composer require simplehtmldom/simplehtmldom dev-master and use simplehtmldom\HtmlWeb;Watanabe
Users today should be aware that while this might have been an acceptable alternative in 2013, modern code really should stick with a libxml-based parser like the built-in DOMDocument classes. This library uses string manipulation and regular expressions to parse HTML and so is is exponentially slower and more memory-hungry on large tasks, as well as being less accurate in certain cases. See this answer or other answers to this question for some better alternatives.Trinhtrini
E
31

I have used DOMDocument and DOMXPath to get the solution:

$test = <<< HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($test);
$xpath = new DOMXPath($dom);
$heading = parseToArray($xpath,'Heading1-H');
$content = parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray(DOMXPath $xpath, string $class): array
{
    $xpathquery = "//*[@class='$class']";
    $elements = $xpath->query($xpathquery);

    $resultarray = [];
    foreach ($elements as $element) {
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          $resultarray[] = $node->nodeValue;
        }
    }

    return $resultarray;
}
Escharotic answered 21/8, 2013 at 5:45 Comment(1)
I've found this link to be very useful to learn the XPATH.query syntax: w3schools.com/xml/xpath_syntax.aspVidovik
U
24

Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}
Usurious answered 21/8, 2013 at 4:58 Comment(7)
!!NOTICE!! not using "->innertext" leads to memory leaks.Zebe
This is a much easier option and produces more readable code compared to using DomDocument.Alarise
Is there an option to install that with composer?Watanabe
Composer install is now possible: composer require simplehtmldom/simlehtmldom dev-master and use simplehtmldom\HtmlWeb;Watanabe
@Watanabe there is a typo in your comment. missing the "p" in the second "simple" in the composer require commandTesch
@Tesch yeah, that typo is in the linked official source. The corrected version would then be like this: composer require simplehtmldom/simplehtmldom dev-master and use simplehtmldom\HtmlWeb;Watanabe
Users today should be aware that while this might have been an acceptable alternative in 2013, modern code really should stick with a libxml-based parser like the built-in DOMDocument classes. This library uses string manipulation and regular expressions to parse HTML and so is is exponentially slower and more memory-hungry on large tasks, as well as being less accurate in certain cases. See this answer or other answers to this question for some better alternatives.Trinhtrini
P
12

Here's an alternative way to parse the html using DiDOM.

composer require imangazaliev/didom
<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = <<<HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$document = new Document($html);

// find chapter headings
$elements = $document->find('.Heading1-H');

$headings = [];

foreach ($elements as $element) {
    $headings[] = $element->text();
}

// find chapter texts
$elements = $document->find('.Normal-H');

$chapters = [];

foreach ($elements as $element) {
    $chapters[] = $element->text();
}

echo("Headings\n");

foreach ($headings as $heading) {
    echo("- {$heading}\n");
}

echo("Chapter texts\n");

foreach ($chapters as $chapter) {
    echo("- {$chapter}\n");
}
Photogene answered 25/12, 2020 at 6:11 Comment(4)
@Trinhtrini why the edit?Photogene
Because this isn't an advertisement, micro-optimizations are typically a waste of time, and everyone already knows Simple HTML DOM is trash.Trinhtrini
Or at least they should by now lolTrinhtrini
Respectfully disagree.Photogene
H
6

One option for you is to use DOMDocument and DOMXPath. They do require a bit of a curve to learn, but once you do, you will be pretty happy with what you can achieve.

Read the following in php.net

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

Hope this helps.

Harmonious answered 21/8, 2013 at 5:0 Comment(1)
This has problem with broken htmlZebe
M
0

Here is the functional-style equivalent of @saji89's answer. Search for any element on any level which has the desired class (use contains() if there may be multiple classes assigned to an element), then target the node text with text(). After converting the XPath object to an array, simply isolate the nodeValue column.

Code: (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach (['Heading1-H', 'Normal-H'] as $class) {
    var_export(
        array_column(
            iterator_to_array($xpath->query("//*[@class='$class']/text()")),
            'nodeValue'
        )
    );
    echo "\n---\n";
}

Output:

array (
  0 => 'Chapter 1',
  1 => 'Chapter 2',
  2 => 'Chapter 3',
)
---
array (
  0 => 'This is chapter 1',
  1 => 'This is chapter 2',
  2 => 'This is chapter 3',
)
---
Mansion answered 10/7 at 23:51 Comment(0)
T
0

The DOMDocument answers all use XPath, but XPath syntax can be intimidating for new users and for simple processing like this it isn't necessary.

$html_string = <<< HTML
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html_string);

foreach($dom->getElementsByTagName('span') as $element) {
    $class = $element->getAttribute('class');
    if ($class === 'Heading1-H') {
        $heading[] = $element->textContent;
    } elseif($class === 'Normal-H') {
        $content[] = $element->textContent;
    }
}
print_r($heading);
print_r($content);

Note when looking for a class in particular, a better check would be something like preg_match('\bNormal-H\b', $class) to account for the possibility of multiple items in the class list.

Trinhtrini answered 11/7 at 21:55 Comment(0)
H
-13

// Create DOM from URL or file

$html = file_get_html('http://www.google.com/');

// Find all images

foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

// Find all links

foreach($html->find('a') as $element) 
   echo $element->href . '<br>';
Halfbound answered 5/3, 2014 at 7:55 Comment(2)
file_get_html ?? Is that a PHP function ?Hippolytus
file_get_content is right. he has copy past from php simple dom websiteMikesell

© 2022 - 2024 — McMap. All rights reserved.