Get the text from all elements with a nominated class as a flat array

I

7

28

I know we can use PHP DOM to parse HTML using PHP, but I have a specific requirement. I have an HTML content like below

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

$heading and $content

$heading = array('Chapter 1', 'Chapter 2', 'Chapter 3');
$content = array('This is chapter 1', 'This is chapter 2', 'This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way.

Indignity answered 21/8, 2013 at 4:55 Comment(6)

use jquery as its structure is simple. – Quintuplet 21/8, 2013 at 4:58

@Susheel: HTML content will be much bigger as it is the output after parsing docx files – Indignity 21/8, 2013 at 5:0

You could use regular expressions if you don't like to go for PHP DOM. – Dupery 21/8, 2013 at 5:0

@LorenzMeyer do not use regular expressions to parse html – Manicure 21/8, 2013 at 5:6

@blessed for bigger dom use php simple dom parser – Quintuplet 21/8, 2013 at 5:8

@blessed, I have added solution at: https://mcmap.net/q/489085/-get-the-text-from-all-elements-with-a-nominated-class-as-a-flat-array – Escharotic 21/8, 2013 at 6:21

U

24

Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}

Usurious answered 21/8, 2013 at 4:58 Comment(7)

!!NOTICE!! not using "->innertext" leads to memory leaks. – Zebe 14/7, 2019 at 21:19

This is a much easier option and produces more readable code compared to using DomDocument. – Alarise 23/2, 2020 at 14:53

Is there an option to install that with composer? – Watanabe 10/6, 2020 at 17:9

Composer install is now possible: composer require simplehtmldom/simlehtmldom dev-master and use simplehtmldom\HtmlWeb; – Watanabe 10/6, 2020 at 17:13

@Watanabe there is a typo in your comment. missing the "p" in the second "simple" in the composer require command – Tesch 22/12, 2022 at 21:55

@Tesch yeah, that typo is in the linked official source. The corrected version would then be like this: composer require simplehtmldom/simplehtmldom dev-master and use simplehtmldom\HtmlWeb; – Watanabe 4/1, 2023 at 15:1

Users today should be aware that while this might have been an acceptable alternative in 2013, modern code really should stick with a libxml-based parser like the built-in DOMDocument classes. This library uses string manipulation and regular expressions to parse HTML and so is is exponentially slower and more memory-hungry on large tasks, as well as being less accurate in certain cases. See this answer or other answers to this question for some better alternatives. – Trinhtrini 11/7 at 18:34

E

31

I have used DOMDocument and DOMXPath to get the solution:

$test = <<< HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($test);
$xpath = new DOMXPath($dom);
$heading = parseToArray($xpath,'Heading1-H');
$content = parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray(DOMXPath $xpath, string $class): array
{
    $xpathquery = "//*[@class='$class']";
    $elements = $xpath->query($xpathquery);

    $resultarray = [];
    foreach ($elements as $element) {
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          $resultarray[] = $node->nodeValue;
        }
    }

    return $resultarray;
}

Escharotic answered 21/8, 2013 at 5:45 Comment(1)

I've found this link to be very useful to learn the XPATH.query syntax: w3schools.com/xml/xpath_syntax.asp – Vidovik 15/7, 2020 at 20:45

U

24

Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}

Usurious answered 21/8, 2013 at 4:58 Comment(7)

!!NOTICE!! not using "->innertext" leads to memory leaks. – Zebe 14/7, 2019 at 21:19

This is a much easier option and produces more readable code compared to using DomDocument. – Alarise 23/2, 2020 at 14:53

Is there an option to install that with composer? – Watanabe 10/6, 2020 at 17:9

Composer install is now possible: composer require simplehtmldom/simlehtmldom dev-master and use simplehtmldom\HtmlWeb; – Watanabe 10/6, 2020 at 17:13

@Watanabe there is a typo in your comment. missing the "p" in the second "simple" in the composer require command – Tesch 22/12, 2022 at 21:55

@Tesch yeah, that typo is in the linked official source. The corrected version would then be like this: composer require simplehtmldom/simplehtmldom dev-master and use simplehtmldom\HtmlWeb; – Watanabe 4/1, 2023 at 15:1

Users today should be aware that while this might have been an acceptable alternative in 2013, modern code really should stick with a libxml-based parser like the built-in DOMDocument classes. This library uses string manipulation and regular expressions to parse HTML and so is is exponentially slower and more memory-hungry on large tasks, as well as being less accurate in certain cases. See this answer or other answers to this question for some better alternatives. – Trinhtrini 11/7 at 18:34

P

12

Here's an alternative way to parse the html using DiDOM.

composer require imangazaliev/didom

<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = <<<HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$document = new Document($html);

// find chapter headings
$elements = $document->find('.Heading1-H');

$headings = [];

foreach ($elements as $element) {
    $headings[] = $element->text();
}

// find chapter texts
$elements = $document->find('.Normal-H');

$chapters = [];

foreach ($elements as $element) {
    $chapters[] = $element->text();
}

echo("Headings\n");

foreach ($headings as $heading) {
    echo("- {$heading}\n");
}

echo("Chapter texts\n");

foreach ($chapters as $chapter) {
    echo("- {$chapter}\n");
}

Photogene answered 25/12, 2020 at 6:11 Comment(4)

@Trinhtrini why the edit? – Photogene 11/7 at 5:43

Because this isn't an advertisement, micro-optimizations are typically a waste of time, and everyone already knows Simple HTML DOM is trash. – Trinhtrini 11/7 at 16:50

Or at least they should by now lol – Trinhtrini 11/7 at 18:35

Respectfully disagree. – Photogene 12/7 at 6:27

H

6

One option for you is to use DOMDocument and DOMXPath. They do require a bit of a curve to learn, but once you do, you will be pretty happy with what you can achieve.

Read the following in php.net

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

Hope this helps.

Harmonious answered 21/8, 2013 at 5:0 Comment(1)

This has problem with broken html – Zebe 14/7, 2019 at 20:56

M

0

Here is the functional-style equivalent of @saji89's answer. Search for any element on any level which has the desired class (use contains() if there may be multiple classes assigned to an element), then target the node text with text(). After converting the XPath object to an array, simply isolate the nodeValue column.

Code: (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach (['Heading1-H', 'Normal-H'] as $class) {
    var_export(
        array_column(
            iterator_to_array($xpath->query("//*[@class='$class']/text()")),
            'nodeValue'
        )
    );
    echo "\n---\n";
}

Output:

array (
  0 => 'Chapter 1',
  1 => 'Chapter 2',
  2 => 'Chapter 3',
)
---
array (
  0 => 'This is chapter 1',
  1 => 'This is chapter 2',
  2 => 'This is chapter 3',
)
---

Mansion answered 10/7 at 23:51 Comment(0)

T

0

The DOMDocument answers all use XPath, but XPath syntax can be intimidating for new users and for simple processing like this it isn't necessary.

$html_string = <<< HTML
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html_string);

foreach($dom->getElementsByTagName('span') as $element) {
    $class = $element->getAttribute('class');
    if ($class === 'Heading1-H') {
        $heading[] = $element->textContent;
    } elseif($class === 'Normal-H') {
        $content[] = $element->textContent;
    }
}
print_r($heading);
print_r($content);

Note when looking for a class in particular, a better check would be something like preg_match('\bNormal-H\b', $class) to account for the possibility of multiple items in the class list.

Trinhtrini answered 11/7 at 21:55 Comment(0)

H

-13

// Create DOM from URL or file

$html = file_get_html('http://www.google.com/');

// Find all images

foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

// Find all links

foreach($html->find('a') as $element) 
   echo $element->href . '<br>';

Halfbound answered 5/3, 2014 at 7:55 Comment(2)

file_get_html ?? Is that a PHP function ? – Hippolytus 15/1, 2016 at 11:29

file_get_content is right. he has copy past from php simple dom website – Mikesell 19/9, 2016 at 10:57

Recommended topics

Hot tags