If you want to output UTF-8 with DOMDocument, you need to specify that. Simple, isn't it? If you already smell a trick question, you're not too far off, but on first sight, it really is straight forward.
Consider the following (UTF-8 encoded) code-example that outputs hexadecimal entities:
$dom = new DOMDocument();
$dom->loadXml('<root>ירושלים</root>');
$dom->save('php://output');
Output:
<?xml version="1.0"?>
<root>ירושלים</root>
As written, if you want to output this as UTF-8, you need to specify it, and it is straight forward:
...
$dom->encoding = 'UTF-8';
$dom->save('php://output');
The output then is in UTF-8 explicitly:
<?xml version="1.0" encoding="UTF-8"?>
<root>ירושלים</root>
So much for the straight forward part. If you are interested in the dirty little details, you are free to read on - if not, please do not ask "why?" :).
I just wrote "in UTF-8 explicitly" because also in the first example the output is UTF-8 encoded, the XML just contained hexadecimal entities which is perfectly valid - even in UTF-8!
You already notice that I start with nit-picking here, but remember: UTF-8 is the default encoding of XML.
And if you now start to say: Hey wait, if the default encoding is UTF-8 anyway, why does PHPs DOMDocument use the entities in the first place?
Well the truth is, it does not contrary to the finding in the question. Not always.
See the following example which is using an XML-comment instead of a node value containing the Ivrit letters:
$dom = new DOMDocument();
$dom->loadXml('<root><!-- ירושלים --></root>');
$dom->save('php://output');
Output:
<?xml version="1.0"?>
<root><!-- ירושלים --></root>
Okay, all clear? So the dirty little secret here is: Whether you've got those XML entities in there or not - for the document it does not make a difference, it is just a different form of writing the same XML character data. And you already feel invited: Lets try CDATA instead for the first example:
$dom = new DOMDocument();
$dom->loadXML("<root><![CDATA[ירושלים]]></root>");
$dom->save('php://output');
Output:
<?xml version="1.0"?>
<root><![CDATA[ירושלים]]></root>
As this demonstrates like with the XML-comment example before, there are no XML entities used here. Well, they would not be valid anyway, like with the XML-comment example.
For the overview lets create an example that contains all these:
$dom = new DOMDocument();
$dom->loadXML("<!-- ירושלים --><root>ירושלים <![CDATA[ירושלים]]></root>");
$dom->save('php://output');
Output:
<?xml version="1.0"?>
<!-- ירושלים -->
<root>ירושלים <![CDATA[ירושלים]]></root>
Lessons learned:
- UTF-8 is always used. Just some entities are used in PCDATA unless the UTF-8 encoding is specified. If a different to UTF-8 encoding is specified, different rules apply.
- You can not specify if you want to use entities or not for output by loading an XML document as UTF-8 encoded string in PHPs DOMDocument per-se. Not even with libxml flags nor by providing a BOM. [1]
- You can specify that you do not want to use entities by setting the documents encoding to UTF-8.
- If you can, you can manipulate the input string having an XML-Declaration specifying the documents encoding as outlined in gordon's answer.
Tip: If your string has an XML-Declaration that mismatches the strings encoding or you want to change either of both before loading the string into DOMDocument you need to change the XML-Declaration and/or re-encode the string. This has been covered in an answer to the question PHP XMLReader, get the version and encoding by showing how the XMLRecoder
class works.
And that's it hopefully.
[1] Probably if you load from a HTTP request and you provide stream context and flag the character encoding via meta-data - but this should be tested first, I do not know. That the BOM does not work is somewhat a sign that all these things do not work.