DomDocument and special characters

K

10

36

This is my code:

$oDom = new DOMDocument();
$oDom->loadHTML("èàéìòù");
echo $oDom->saveHTML();

This is the output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&Atilde;&uml;&Atilde;&nbsp;&Atilde;&copy;&Atilde;&not;&Atilde;&sup2;&Atilde;&sup1;</p></body></html>

I want this output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èàéìòù</p></body></html>

I've tried with ...

$oDom = new DomDocument('4.0', 'UTF-8');

or with 1.0 and other stuffs but nothing.

Another thing ... There is a way to obtain the same untouched HTML? For example with this html in input <p>hello!</p> obtain the same output <p>hello!</p> using DOMDocument only for parsing the DOM and to do some substitutions inside the tags.

Kopeisk answered 4/7, 2011 at 15:12 Comment(2)

given you've got Ã, in the output, something's mangling your UTF-8 and making it look like iso-8859 or similar. – Calico 4/7, 2011 at 15:27

possible duplicate of PHP DOMDocument loadHTML not encoding UTF-8 correctly – Dishtowel 11/2, 2013 at 10:18

K

65

Solution:

$oDom = new DOMDocument();
$oDom->encoding = 'utf-8';
$oDom->loadHTML( utf8_decode( $sString ) ); // important!

$sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
$sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!

The saveHTML() method works differently specifying a node. You can use the main node ($oDom->documentElement) adding the desired !DOCTYPE manually. Another important thing is utf8_decode(). All the attributes and the other methods of the DOMDocument class, in my case, don't produce the desired result.

Kopeisk answered 8/7, 2011 at 6:11 Comment(2)

To make this work with other characters outside of the ISO-8859-1 set, you need to use multi-byte decoding. So that characters like chinese or the euro sign with also be properly encoded. $oDom->loadHTML(mb_convert_encoding($sString, 'HTML-ENTITIES', 'UTF-8')); see here for more info – Somali 16/7, 2015 at 19:58

I almost lose my mind trying to solve this! Thank you very much! – Attitudinize 31/5, 2021 at 18:56

C

7

Try to set the encoding type after you have loaded the HTML.

$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->encoding = 'utf-8';
echo $dom->saveHTML();

Other way

Cumulonimbus answered 4/7, 2011 at 15:32 Comment(0)

L

7

$dom = new DomDocument();
$str = htmlentities($str);
$dom->loadHTML(utf8_decode($str));
$dom->encoding = 'utf-8';
.
.
.
$str = $dom->saveHTML();
$str = html_entity_decode($str);

The above code worked for me.

Lowelllowenstein answered 28/2, 2020 at 7:34 Comment(0)

N

6

I don't know why the marked answer didn't work for my problem. But this one did.

ref: https://www.php.net/manual/en/class.domdocument.php

<?php

            // checks if the content we're receiving isn't empty, to avoid the warning
            if ( empty( $content ) ) {
                return false;
            }

            // converts all special characters to utf-8
            $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

            // creating new document
            $doc = new DOMDocument('1.0', 'utf-8');

            //turning off some errors
            libxml_use_internal_errors(true);

            // it loads the content without adding enclosing html/body tags and also the doctype declaration
            $doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

            // do whatever you want to do with this code now

?>

Nessim answered 9/10, 2019 at 3:38 Comment(0)

E

5

The issue appears to be known, according to the user comments on the manual page at php.net. Solutions suggested there include putting

<meta http-equiv="content-type" content="text/html; charset=utf-8">

in the document before you put any strings with non-ASCII chars in.

Another hack suggests putting

<?xml encoding="UTF-8">

as the first text in the document and then removing it at the end.

Nasty stuff. Smells like a bug to me.

Encrust answered 6/7, 2011 at 12:3 Comment(0)

R

4

This way:

/**
 * @param string $text
 * @return DOMDocument
 */
private function buildDocument($text)
{
    $dom = new DOMDocument();

    libxml_use_internal_errors(true);
    $dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $text);
    libxml_use_internal_errors(false);

    return $dom;
}

Rolanda answered 31/10, 2018 at 12:0 Comment(1)

I needed it for an API endpoint that a mobile app uses. And only this solution worked for me. Thanks :) – Tailpiece 10/7, 2019 at 12:56

A

3

What worked for me was:

$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

credit: https://davidwalsh.name/domdocument-utf8-problem

Acriflavine answered 20/3, 2020 at 7:11 Comment(1)

that fixed my issue, on turkish chars. – Intercommunion 26/2, 2022 at 21:44

M

1

None of the above worked for me but this one did the job:

$fileContent = file_get_contents('my_file.html');
$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($fileContent, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->encoding = 'utf-8';
$html = $dom->saveHTML();
$html = html_entity_decode($html, ENT_COMPAT, 'UTF-8');
echo $html;

Musical answered 22/4, 2021 at 8:40 Comment(0)

K

0

Looks like you just need to set substituteEntities when you create the DOMDocument object.

Knowledge answered 4/7, 2011 at 15:15 Comment(0)

B

0

This worked for me:

<?php

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item) {
    if ($item->nodeType == XML_PI_NODE) {
        $doc->removeChild($item); // remove hack
    }
}

?>

Credits: https://www.php.net/manual/en/domdocument.loadhtml.php#95251

Bequest answered 28/2, 2023 at 21:34 Comment(0)

Recommended topics

Hot tags