How to prevent DOMDocument from converting   to unicode
Asked Answered
T

2

5

I am trying to get the inner HTML of a DOMElement in PHP. Example markup:

<div>...</div>
<div id="target"><p>Here's some &nbsp; <em>funny</em> &nbsp; text</p></div>
<div>...</div>
<div>...</div>

Feeding the above string into the variable $html, I am doing:

$doc = new DOMDocument();
@$doc->loadHTML("<html><body>$html</body></html>");
$node = $doc->getElementById('target')
$markup = '';
foreach ($node->childNodes as $child) {
  $markup .= $child->ownerDocument->saveXML($child);
}

The resulting $markup string looks like this (converted to JSON to reveal the invisible characters):

"<p>Here's some \u00a0 <em>funny<\/em> \u00a0 text<\/p>"

All &nbsp; characters have been converted to Unicode non-breaking spaces, which breaks my application.

In my ideal world, there would be a way to retrieve the original string of HTML inside the target div as-is, without DomDocument doing anything to it at all. That doesn't seem to be possible, so the next best thing would be to somehow turn off this character conversion. So far I've tried:

  • Setting $doc->substituteEntities = false; with no result. Changing it to true doesn't help either.
  • Toggling $doc->preserveWhiteSpace with no change either way
  • Changing saveXML to saveHTML. Doesn't make a difference.

Finally I resorted to tacking on this hack, which works but doesn't feel like the right solution.

$markup = str_replace("\xc2\xa0", '&nbsp;', $markup);

Surely there is a better way?

Tremann answered 2/12, 2019 at 22:13 Comment(1)
“In my ideal world, there would be a way to retrieve the original string of html inside the target div as-is, without DOMDocument doing anything to it at all.” - you either want to work based on text, or a DOM. Once you work with DOM, you “surrender” your rights to demand anything be represented exactly the same way, as some “source code” originally did.Nonattendance
I
7

You can use the very cryptic function mb_encode_numericentity() to convert those characters outside of the visible ASCII range, so it won't touch your markup and such:

<?php
$html = <<< HTML
<div>...</div>
<div id="target"><p>Here's some &nbsp; <em>funny 😂</em> &nbsp; text</p></div>
<div>...</div>
<div>...</div>
HTML;

$doc = new DOMDocument();
libxml_use_internal_errors();
$doc->loadHTML("<html><head><meta charset=UTF-8></head><body>$html</body></html>");
$node = $doc->getElementById('target');
$markup = '';
foreach ($node->childNodes as $child) {
  $markup .= $child->ownerDocument->saveHTML($child);
}

$convmap = [
    0x00, 0x1f, 0, 0xff,
    0x7f, 0x10ffff, 0, 0xffffff,
];

$markup = mb_encode_numericentity($markup, $convmap, "UTF-8");

echo $markup;

Output:

<p>Here's some &#160; <em>funny &#128514;</em> &#160; text</p>

Outside of the scope of the original question, but I've added an emoji to the string as well. To encode multibyte characters, <meta charset="UTF-8"> will force PHP to treat the content as Unicode instead of its default ISO-8859-1.

Ita answered 22/1, 2020 at 21:11 Comment(5)
This is now deprecated in php 8 and gives this notice: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead. I don't believe any of the suggested alternatives will do the same job, so I've resorted to a simply str_replace instead.Draughts
@Draughts I just tried this in 8.1.26 and got no such notice. Looks like just 8.2 where it becomes a problem.Ita
@Draughts answer is updatedIta
Thanks for the update @Ita :) It's so much more complicated than a str_replace, lol, but admittedly far more complete in converting the UTF-8 characters back to HTML entities. Wouldn't it convert to numeric form though, rather than &nbsp; form? And yes, you are correct it's php 8.2 where this use of mb_convert_encoding has been deprecated - will be removed in php 9 I believe.Draughts
@seb you’re right, will update the output shortlyIta
L
0

I also ran into this issue, it is basically described here already.

The solutions provided worked for me, but only the &nbsp; character failed, so I came here. The provided solution by miken32 did not work for me, at least not when saving, but rather when loading the html. The solution is:

$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

This solution is also described in the linked stackoverflow issue and this blog post that helped me solving the issue.

Landis answered 13/5, 2022 at 15:35 Comment(1)
This question is about HTML that had entities in it, and the user wanted to preserve them instead of converting to Unicode. The question you've linked to is about the opposite problem; the user has Unicode in their HTML that isn't being preserved.Ita

© 2022 - 2025 — McMap. All rights reserved.