DOMDocument and HTML entities

Asked 28/8, 2011 at 11:46 Answered 12/5, 2021 at 11:56

I'm trying to parse some HTML that includes some HTML entities, like ×

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";

but DomDocument substitutes the text for for A Ã— B.

Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything

Palmerpalmerston answered 28/8, 2011 at 11:46 Comment(2)

why do you want to keep them? – Rumrunner 28/8, 2011 at 11:50

I only sort of want to, what I actually want to do is replacing them with an x because that'd put the text in the same format as some old code from a scraper I'm updating, and I have absolutely no idea how I'd go about including those symbols in a regex – Palmerpalmerston 28/8, 2011 at 14:18

This is no direct answer to the question, but you may use UTF-8 instead, which allows you to save glyphs like ÷ or × directly. To use UTF-8 with PHP DOM on the other needs a little hack.

Also, if you are trying to display mathematical formulas (as A × B suggests) have a look at MathML.

Everything answered 28/8, 2011 at 11:57 Comment(3)

Thank for the hack, it resolve my issues (even if all my UTF-8 entities are still substituate by HTML ones...). It's 2013 now, and we still have to use a trick to get UTF-8 properly handled :-( – Atbara 5/5, 2013 at 8:30

This answer is low value because all of the insights are held offsite. – Jacobite 19/2, 2021 at 23:55

Nope, the insights are in the answer - what is missing are the copy'n'paste snippets.... – Everything 22/2, 2021 at 8:56

From the docs:

The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

Assuming you're using latin-1 try:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A &#215; B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>

Loraleeloralie answered 28/8, 2011 at 12:17 Comment(2)

Thanks, just using the utf8_encode and decode woked, but I'll read about all the rest you used – Palmerpalmerston 28/8, 2011 at 12:24

Btw, i used but in reverse order - since my initial data was already encoded. Worked well, thanks! – Neurosurgery 2/8, 2012 at 18:2

Are you sure the & is being substituted to &? If that were the case, you'd see the exact entity, as text, not the garbled response you're getting.

My guess is that it is converted to the actual character, and you're viewing the page with a latin1 charset, which does not contain this character, hence the garbled response.

If I render your example, my output is:

fullname:  A × B 

href: http://example.com/

When viewing this in latin1/iso-8859-1, I see the output you're describing. But when I set the charset to UTF-8, the output is fine.

Goutweed answered 28/8, 2011 at 11:55 Comment(3)

That's weird, because I was copying that response from the code. Anyway, I used utf8_encode and deccode and it did the trick. Thanks – Palmerpalmerston 28/8, 2011 at 12:23

If you're viewing the response in a browser, it automatically tries to determine the charset. So if you want to view the actual output, you're better of viewing the page source. – Goutweed 28/8, 2011 at 12:25

Yeah, I meant I was viewing the page source with chrome, and there's where I got what I pasted – Palmerpalmerston 28/8, 2011 at 14:8

I fixed my problem with broken entities by converting UTF-8 to UTF-8 with BOM.

Premillenarian answered 12/5, 2021 at 11:56 Comment(0)

Recommended topics

Hot tags