I'm using Mojo::DOM to identify and print out phrases (meaning strings of text between selected HTML tags) in hundreds of HTML documents that I'm extracting from existing content in the Movable Type content management system.
I'm writing those phrases out to a file, so they can be translated into other languages as follows:
$dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {
print_phrase($phrase); # utility function to write out the phrase to a file
}
When Mojo::DOM encountered embedded HTML entities (such as ™
and
) it converted those entities into encoded characters, rather than passing along as written. I wanted the entities to be passed through as written.
I recognized that I could use Mojo::Util::decode to pass these HTML entities through to the file I'm writing. The problem is "You can only call decode 'UTF-8' on a string that contains valid UTF-8. If it doesn't, for example because it is already converted to Perl characters, it will return undef."
If this is the case, I have to either try to figure out how to test the encoding of the current HTML page before calling Mojo::Util::decode('UTF-8', $page->text)
, or I must use some other technique to preserve the encoded HTML entities.
How do I most reliably preserve encoded HTML Entities when processing HTML documents with Mojo::DOM?
utf8::decode($phrase) if !utf8::is_utf8($phrase);
inside of thefor my $phrase ()
loop above? I would think that this would do nearly the same thing as I was trying to do withMojo::Util::decode('UTF-8', $page->text)
inside of theMojo::DOM->new() statement
, except that the test would only be run on the phrases that were identified in$dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')
. – Cabrales