How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?
Asked Answered
C

2

6

I'm using Mojo::DOM to identify and print out phrases (meaning strings of text between selected HTML tags) in hundreds of HTML documents that I'm extracting from existing content in the Movable Type content management system.

I'm writing those phrases out to a file, so they can be translated into other languages as follows:

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

When Mojo::DOM encountered embedded HTML entities (such as ™ and  ) it converted those entities into encoded characters, rather than passing along as written. I wanted the entities to be passed through as written.

I recognized that I could use Mojo::Util::decode to pass these HTML entities through to the file I'm writing. The problem is "You can only call decode 'UTF-8' on a string that contains valid UTF-8. If it doesn't, for example because it is already converted to Perl characters, it will return undef."

If this is the case, I have to either try to figure out how to test the encoding of the current HTML page before calling Mojo::Util::decode('UTF-8', $page->text), or I must use some other technique to preserve the encoded HTML entities.

How do I most reliably preserve encoded HTML Entities when processing HTML documents with Mojo::DOM?

Cabrales answered 12/3, 2019 at 21:25 Comment(5)
@Grinnz @Robert is it OK to use utf8::decode($phrase) if !utf8::is_utf8($phrase); inside of the for my $phrase () loop above? I would think that this would do nearly the same thing as I was trying to do with Mojo::Util::decode('UTF-8', $page->text) inside of the Mojo::DOM->new() statement, except that the test would only be run on the phrases that were identified in $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a').Cabrales
No. is_utf8 does not tell you anything useful about whether your data is encoded, it is an internal Perl flag. Your data will always be encoded or decoded, you need to know which.Alta
The issue is that I'm getting HTML from the Movable Type API. I have no control over what that HTML content is. What we've seen up to now is that some of the HTML is in the form of UTF-8 encoded strings, and some is apparently considered Perl strings. Sorry if I am repeating myself, to some extent.Cabrales
You may not have control over the input data, but the API you are using should, otherwise it is a bug. I think it is simpler than you are describing, but you'll have to be more specific about what you're doing (and that would be probably off topic for this question).Alta
@Alta I will look at the Movable Type API and see if I see a way to ensure that everything comes out of the wire UTF-8 encoded. Thanks!Cabrales
C
1

Through testing, my colleagues and I were able to determine that Mojo::DOM->new() was decoding ampersand characters (&) automatically, rendering the preservation of HTML Entities as written impossible. To get around this, we added the following subroutine to double encode ampersand:

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or & characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&!g;  # HTML encode ampersand characters
    return $text;
}

Later in the script we pass $page->text through encode_amp() as we instantiate a new Mojo::DOM object.

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# https://mcmap.net/q/1769401/-how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-with-mojo-dom#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

The code block above incorporates previous suggestions from @Grinnz as seen in the comments in this question. Thanks also to @Robert for his answer, which had a good observation about how Mojo::DOM works.

This code definitely works for my application.

Cabrales answered 10/4, 2019 at 2:46 Comment(2)
Please stop using bareword file handles. FILE is not ok. open(my $fh, '<', '/path/to/file'). Never open FILE, '</path/to/file';Angers
I feel like this is missing the actual problem here. Maybe you should change the other parts of your code to look for the characters you want to process rather than the html entities?Valadez
R
3

Looks like when you map to text you get XML entities replaced, but when you instead work with the nodes and use their content, the entities are preserved. This minimal example:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p>this &amp; &quot;that&quot;</p>');
for my $phrase ($dom->find('p')->each) {
    print $phrase->content(), "\n";
}

prints:

this &amp; &quot;that&quot;

If you want to keep your loop and map, replace map('text') with map('content') like this:

for my $phrase ($dom->find('p')->map('content')->each) {

If you have nested tags and want to find only the texts (but not print those nested tag names, only their contents), you'll need to scan the DOM tree:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p><i>this &amp; <b>&quot;</b><b>that</b><b>&quot;</b></i></p><p>done</p>');

for my $node (@{$dom->find('p')->to_array}) {
    print_content($node);
}

sub print_content {
    my ($node) = @_;
    if ($node->type eq "text") {
        print $node->content(), "\n";
    }
    if ($node->type eq "tag") {    
        for my $child ($node->child_nodes->each) {
            print_content($child);
        }
    }
}

which prints:

this & 
"
that
"
done
Rotorua answered 12/3, 2019 at 22:44 Comment(11)
To be more specific, the text method returns the text as it would be displayed by a browser, the content method returns the raw HTML encoded node content (including any nested tags, etc).Alta
@Grinnz, yes, I just tried this. One issue is that I need to strip those nested tags, so that I'm down to the innerText, at least in effect. The other issue is that it looks like the line $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('content')->each results in entities being converted from &quote; to &#39;.Cabrales
@DaveAiello I've added code to handle nested tags. Basically the same, you just need to add a recursive sub to walk the tree.Rotorua
You can use descendant_nodes instead of recursion, for example: $dom->find('p')->map('descendant_nodes')->grep(sub { $_->type eq 'text' })->map('content')->eachAlta
@Alta Maybe... but not like you posted --- Can't locate object method "type" via package "Mojo::Collection"Rotorua
Right, to put the descendant nodes from multiple tags into one collection you could put in ->map('each') before the grepAlta
I'm inclined to answer my own question-- because the code that I end up using is going to be an amalgamation of what was discussed in several comments, and the code looks more like what I originally stated than what @Rotorua proposed as a solution.Cabrales
@Alta use of the map('descendant_nodes')->map('each')... logic you proposed is working pretty well. The biggest problem is that decendant_nodes used in conjunction with multiple selectors ('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a') gives me duplicate phrases. In some cases the repetition occurs 3 or more times. Ideally, we would only get one copy of each unique phrase per document. There probably needs to be a reduce in there somewhere. Do you agree? If so, where? Working by trial and error at the moment.Cabrales
@Alta looks like I misspoke. I just looked at Mojo::Collection, and it appears we need a uniq in the right place. Maybe something like for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {...}?Cabrales
@DaveAiello It is because you are matching some tags that are contained in other tags you matched.Alta
@Alta no doubt. I needed to go with uniq because I couldn't guarantee that selectors like span were always inside of p. So I felt like I needed to keep all of the selectors in $dom->find() that were already there.Cabrales
C
1

Through testing, my colleagues and I were able to determine that Mojo::DOM->new() was decoding ampersand characters (&) automatically, rendering the preservation of HTML Entities as written impossible. To get around this, we added the following subroutine to double encode ampersand:

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or &amp; characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&amp;!g;  # HTML encode ampersand characters
    return $text;
}

Later in the script we pass $page->text through encode_amp() as we instantiate a new Mojo::DOM object.

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# https://mcmap.net/q/1769401/-how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-with-mojo-dom#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

The code block above incorporates previous suggestions from @Grinnz as seen in the comments in this question. Thanks also to @Robert for his answer, which had a good observation about how Mojo::DOM works.

This code definitely works for my application.

Cabrales answered 10/4, 2019 at 2:46 Comment(2)
Please stop using bareword file handles. FILE is not ok. open(my $fh, '<', '/path/to/file'). Never open FILE, '</path/to/file';Angers
I feel like this is missing the actual problem here. Maybe you should change the other parts of your code to look for the characters you want to process rather than the html entities?Valadez

© 2022 - 2024 — McMap. All rights reserved.