THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.
THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.
Illustrating the problem
Suppose
$xmlFrag ='<p>Hello world!    Let A<B and A=∬dxdy</p>';
Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt
not. The XML text will be (after transformed),
$xmlFrag = '<p>
Hello world! Let A<
B and A=∬dxdy</p>
';
The text "A<B" needs an XML-reserved character, so MUST stay as A<B
.
Frustrated solutions
I try to use html_entity_decode
for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1
option,
$s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
// as I expected
Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.
I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,
function xml_entity_decode($s) {
// here an illustration (by user-defined function)
// about how the hypothetical PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
//$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
} // you see? not need a benchmark:
// it is not so fast as direct use of html_entity_decode; if there
// was an XML-safe option was ideal.
PS: corrected after this answer. Must be ENT_HTML5
flag, for convert really all named entities.
html_entity_decode
does what I'd expect it to do, given your input - hence why I think the issue is why you think you need to decode it? – Mccallisterhtml_entity_decode
, is about "where the PHP build-in function that do this?"... So, html_entity_decode was my guess, and I showed how is frustrating to try to use it in that context. I edited the question (check if introduction is better) to emphatise the problem, sorry my difficulty to express it in english. PS: perhaps there are no such build-in function, so my dream is see PHP5.6's html_entity_decode with an option to do this simple and imoportant task. – Fredelxml_entity_decode()
works fine and need 1/10 of the time of non-native workaround... REPEATING: the problem here is not my function, is the absence of a PHP-buildin function/parameter that solves the problem. – Fredel