DOMDocument and HTML entities
Asked Answered
P

4

7

I'm trying to parse some HTML that includes some HTML entities, like ×

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    

but DomDocument substitutes the text for for A × B.

Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything

Palmerpalmerston answered 28/8, 2011 at 11:46 Comment(2)
why do you want to keep them?Rumrunner
I only sort of want to, what I actually want to do is replacing them with an x because that'd put the text in the same format as some old code from a scraper I'm updating, and I have absolutely no idea how I'd go about including those symbols in a regexPalmerpalmerston
E
4

This is no direct answer to the question, but you may use UTF-8 instead, which allows you to save glyphs like ÷ or × directly. To use UTF-8 with PHP DOM on the other needs a little hack.

Also, if you are trying to display mathematical formulas (as A × B suggests) have a look at MathML.

Everything answered 28/8, 2011 at 11:57 Comment(3)
Thank for the hack, it resolve my issues (even if all my UTF-8 entities are still substituate by HTML ones...). It's 2013 now, and we still have to use a trick to get UTF-8 properly handled :-(Atbara
This answer is low value because all of the insights are held offsite.Jacobite
Nope, the insights are in the answer - what is missing are the copy'n'paste snippets....Everything
L
4

From the docs:

The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

Assuming you're using latin-1 try:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A &#215; B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>
Loraleeloralie answered 28/8, 2011 at 12:17 Comment(2)
Thanks, just using the utf8_encode and decode woked, but I'll read about all the rest you usedPalmerpalmerston
Btw, i used but in reverse order - since my initial data was already encoded. Worked well, thanks!Neurosurgery
G
1

Are you sure the & is being substituted to &amp;? If that were the case, you'd see the exact entity, as text, not the garbled response you're getting.

My guess is that it is converted to the actual character, and you're viewing the page with a latin1 charset, which does not contain this character, hence the garbled response.

If I render your example, my output is:

fullname:  A × B 

href: http://example.com/

When viewing this in latin1/iso-8859-1, I see the output you're describing. But when I set the charset to UTF-8, the output is fine.

Goutweed answered 28/8, 2011 at 11:55 Comment(3)
That's weird, because I was copying that response from the code. Anyway, I used utf8_encode and deccode and it did the trick. ThanksPalmerpalmerston
If you're viewing the response in a browser, it automatically tries to determine the charset. So if you want to view the actual output, you're better of viewing the page source.Goutweed
Yeah, I meant I was viewing the page source with chrome, and there's where I got what I pastedPalmerpalmerston
P
0

I fixed my problem with broken entities by converting UTF-8 to UTF-8 with BOM.

Premillenarian answered 12/5, 2021 at 11:56 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.