PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?
Asked Answered
F

6

5

THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
  PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.

THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.

Illustrating the problem

Suppose

  $xmlFrag ='<p>Hello world! &#160;&#160; Let A&lt;B and A=&#x222C;dxdy</p>';

Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),

$xmlFrag = '<p>Hello world!    Let A&lt;B and A=∬dxdy</p>';

The text "A<B" needs an XML-reserved character, so MUST stay as A&lt;B.


Frustrated solutions

I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,

  $s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
                                                        // as I expected

Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.


I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,

  function xml_entity_decode($s) {
    // here an illustration (by user-defined function) 
    // about how the hypothetical PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 

    //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+

    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
  }  // you see? not need a benchmark: 
     //  it is not so fast as direct use of html_entity_decode; if there 
     //  was an XML-safe option was ideal.

PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.

Fredel answered 4/8, 2013 at 4:38 Comment(7)
Your XML fragment there is already well formed XML - why are you trying to decode it? It looks like you're trying to solve a different problem to the one you have.Mccallister
I need a fast build-in function, perhaps html_entity_decode() without bugs, and I illustrated the function with a user-defined function.Fredel
html_entity_decode does what I'd expect it to do, given your input - hence why I think the issue is why you think you need to decode it?Mccallister
@RowlandShaw, the question is not, directly, about html_entity_decode, is about "where the PHP build-in function that do this?"... So, html_entity_decode was my guess, and I showed how is frustrating to try to use it in that context. I edited the question (check if introduction is better) to emphatise the problem, sorry my difficulty to express it in english. PS: perhaps there are no such build-in function, so my dream is see PHP5.6's html_entity_decode with an option to do this simple and imoportant task.Fredel
So it sounds like you want the method to transform the XML to something semantically identical, but without using entities where possible? In which case, I suspect that the method isn't there, as it shouldn't be needed - any XML parser reading the XML should treat your two fragments exactly the same (assuming the UTF-8 encoding doesn't get mangled/misrepresented on the way)Mccallister
Yes, it is, "to transform the XML to something semantically identical, but without using entities where possible". But, about utitily, see question: I MUST save (or interchange) the file as UTF8, is not for an "expert tool that have your DOM internal representation, and loads any thing". It is a real problem and a real limitation of PHP.Fredel
Pay attention, as commented here, my solution xml_entity_decode() works fine and need 1/10 of the time of non-native workaround... REPEATING: the problem here is not my function, is the absence of a PHP-buildin function/parameter that solves the problem.Fredel
F
7

This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.

... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:

The best workaround

Pay attention:

  1. The function xml_entity_decode() below is the best (over any other) workaround.
  2. The function below is not an answer to the present question, it is only a workwaround.
  function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }  

To test and to demonstrate that you have a better solution, please test first with this simple benckmark:

  $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";
  
Fredel answered 4/8, 2013 at 4:38 Comment(0)
C
2

Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:

$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);
Calvano answered 21/11, 2013 at 14:46 Comment(6)
Yes, is not a solution for my problem (I not have the DTDs), but is a good solution for people that is working with complete XML+DTD sets... And not need performance. My problem is (as stated) "I need a XML file full encoded by UTF8 ... and, I need a build-in function that do it fast". Do you have some benchmark to compare perforances of your load/save workaround with mine xml_entity_decode() ?Fredel
Hello... 1 year without any benchmark? Ok, I do: my xml_entity_decode() need 0.0002 seconds to convert a big XML string. Your "loadXML() and saveXML()" needs 0.0014 seconds to convert the same XML string. So your solution needs ~10 times more than xml_entity_decode()... So, it is not a solution (!).Fredel
@PeterKrauss As long as there are no unusual named entities in the XML, you can leave out the libxml flags and not load the DTD (although if it's JATS XML you're working with, you probably do want to load the DTD, even if it makes things slower). The important part is to add $doc->encoding = 'UTF-8'; before saving the XML. Does that make the benchmark more acceptable?Calvano
@AlfEaton, please test your assertion in a simple fragment with a named character entity, as &nbsp; . Without DTD loadXML() raises an error... So it is not a solution to the described problem.Fredel
@PeterKrauss The described problem doesn't include &nbsp;. If you're using named character entities, then you need to load a DTD that maps them to Unicode codepoints.Calvano
@AlfEaton, sorry, you correct, it was only a PS comment in my description, "Must be ENT_HTML5 flag, for convert really all named entities"... Well, your solution is good but, even for this restricted scope, its performance is bad, and no news at PHP7... I think it is a case to submit a PHP RFC.Fredel
G
2

I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.

Note: I'm assuming default encoding is UTF-8

// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

/* <Foo>€&amp;foo Ç</Foo> */

Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>&#8364;&amp;foo &#199;</Foo> */

In your case you want it the other way around. Encode numbered entities as UTF-8:

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>€&amp;foo Ç</Foo> */

Benchmark

I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.

<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>

Your method

php -r '$q=["&amp;","&gt;","&lt;"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é &amp; ∬</Foo>
=====
Time taken: 2.0397531986237

My method

php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é #_x_amp#; &#8748;</Foo>
=====
Time taken: 4.045273065567

My method (with unicode to numbered entity):

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>&#8364;&amp;foo &#199; &#233; #_x_amp#; &#8748;</Foo>
=====
Time taken: 5.4407880306244

My method (with numbered entity to unicode):

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661
Gelignite answered 15/2, 2016 at 14:1 Comment(0)
C
1
    public function entity_decode($str, $charset = NULL)
{
    if (strpos($str, '&') === FALSE)
    {
        return $str;
    }

    static $_entities;

    isset($charset) OR $charset = $this->charset;
    $flag = is_php('5.4')
        ? ENT_COMPAT | ENT_HTML5
        : ENT_COMPAT;

    do
    {
        $str_compare = $str;

        // Decode standard entities, avoiding false positives
        if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
        {
            if ( ! isset($_entities))
            {
                $_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));

                // If we're not on PHP 5.4+, add the possibly dangerous HTML 5
                // entities to the array manually
                if ($flag === ENT_COMPAT)
                {
                    $_entities[':'] = '&colon;';
                    $_entities['('] = '&lpar;';
                    $_entities[')'] = '&rpar';
                    $_entities["\n"] = '&newline;';
                    $_entities["\t"] = '&tab;';
                }
            }

            $replace = array();
            $matches = array_unique(array_map('strtolower', $matches[0]));
            for ($i = 0; $i < $c; $i++)
            {
                if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
                {
                    $replace[$matches[$i]] = $char;
                }
            }

            $str = str_ireplace(array_keys($replace), array_values($replace), $str);
        }

        // Decode numeric & UTF16 two byte entities
        $str = html_entity_decode(
            preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
            $flag,
            $charset
        );
    }
    while ($str_compare !== $str);
    return $str;
}
Chloechloette answered 9/11, 2014 at 11:51 Comment(2)
Please see and analyse my xml_entity_decode() function, it is ok (!). The problem is not my function, it works, the problem is PHP (where the "native/buildin function"?). About your function: if the behaviour of your xml_convert() is not exactly the same, it is wrong: please check your function, correct it if necessary... And, next, say in what it differ with mine xml_entity_decode().Fredel
Excuse me Peter for my opps!.I have edited my last answer.It is a replacement for html_entity_decode() , I hope it be useful. In html_entity_decode() it is not technically correct to leave out the semicolon at the end of an entity most browsers will still interpret the entity correctly. html_entity_decode() does not convert entities without semicolons, so in this function is left little solution that can be help full for reviewing in your challenge.Chloechloette
C
0

For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:

echo xml_entity_decode('&#128;');
//Output &#128; instead expected €

This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where € is the € sign).

Chabazite answered 18/12, 2019 at 11:29 Comment(1)
see also #9588251Chabazite
S
-1

Try this function:

function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
     return str_replace(array('&','>','<','"'), array('&amp;','&gt;','&lt;','&quot;'), $s);
else
     return str_replace(array('&','>','<'), array('&amp;','&gt;','&lt;'), html_entity_decode($s));
}

example usage:

echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';

also: https://mcmap.net/q/382633/-is-htmlentities-sufficient-for-creating-xml-safe-values

this code used in production seem that no problems happened with UTF-8

Southeastwards answered 29/10, 2013 at 14:42 Comment(1)
No: see my "Illustrating the problem" (at the question text) and try your xmlsafe() with my $xmlFrag.Fredel

© 2022 - 2024 — McMap. All rights reserved.