htmlentities in PHP but preserving html tags
Asked Answered
M

7

59

I want to convert all texts in a string into html entities but preserving the HTML tags, for example this:

<p><font style="color:#FF0000">Camión español</font></p>

should be translated into this:

<p><font style="color:#FF0000">Cami&oacute;n espa&ntilde;ol</font></p>

any ideas?

Mescaline answered 1/9, 2009 at 22:19 Comment(2)
Actually I'd say it's the wrong question. Why do you want to escape those characters?Laurustinus
There could be a use for such a thing (I like Peter's answer), but asking it makes me immediately suspect the OP has a character encoding mismatch problem (usually UTF-8 vs ISO-8859-1) that should be fixed in preference to trying the hide the brokenness behind entity-reference-escaping the small-in-comparison-to-Unicode selection of characters that have entities defined in HTML.Ultramarine
M
68

You can get the list of correspondances character => entity used by htmlentities, with the function get_html_translation_table ; consider this code :

$list = get_html_translation_table(HTML_ENTITIES);
var_dump($list);

(You might want to check the second parameter to that function in the manual -- maybe you'll need to set it to a value different than the default one)

It will get you something like this :

array
  ' ' => string '&nbsp;' (length=6)
  '¡' => string '&iexcl;' (length=7)
  '¢' => string '&cent;' (length=6)
  '£' => string '&pound;' (length=7)
  '¤' => string '&curren;' (length=8)
  ....
  ....
  ....
  'ÿ' => string '&yuml;' (length=6)
  '"' => string '&quot;' (length=6)
  '<' => string '&lt;' (length=4)
  '>' => string '&gt;' (length=4)
  '&' => string '&amp;' (length=5)

Now, remove the correspondances you don't want :

unset($list['"']);
unset($list['<']);
unset($list['>']);
unset($list['&']);

Your list, now, has all the correspondances character => entity used by htmlentites, except the few characters you don't want to encode.

And now, you just have to extract the list of keys and values :

$search = array_keys($list);
$values = array_values($list);

And, finally, you can use str_replace to do the replacement :

$str_in = '<p><font style="color:#FF0000">Camión español</font></p>';
$str_out = str_replace($search, $values, $str_in);
var_dump($str_out);

And you get :

string '<p><font style="color:#FF0000">Cami&Atilde;&sup3;n espa&Atilde;&plusmn;ol</font></p>' (length=84)

Which looks like what you wanted ;-)


Edit : well, except for the encoding problem (damn UTF-8, I suppose -- I'm trying to find a solution for that, and will edit again)

Second edit couple of minutes after : it seem you'll have to use utf8_encode on the $search list, before calling str_replace :-(

Which means using something like this :

$search = array_map('utf8_encode', $search);

Between the call to array_keys and the call to str_replace.

And, this time, you should really get what you wanted :

string '<p><font style="color:#FF0000">Cami&oacute;n espa&ntilde;ol</font></p>' (length=70)


And here is the full portion of code :

$list = get_html_translation_table(HTML_ENTITIES);
unset($list['"']);
unset($list['<']);
unset($list['>']);
unset($list['&']);

$search = array_keys($list);
$values = array_values($list);
$search = array_map('utf8_encode', $search);

$str_in = '<p><font style="color:#FF0000">Camión español</font></p>';
$str_out = str_replace($search, $values, $str_in);
var_dump($str_in, $str_out);

And the full output :

string '<p><font style="color:#FF0000">Camión español</font></p>' (length=58)
string '<p><font style="color:#FF0000">Cami&oacute;n espa&ntilde;ol</font></p>' (length=70)

This time, it should be ok ^^
It doesn't really fit in one line, is might not be the most optimized solution ; but it should work fine, and has the advantage of allowing you to add/remove any correspondance character => entity you need or not.

Have fun !

Mchail answered 1/9, 2009 at 22:29 Comment(4)
+1 for the utf-8 part. was using strtr first and that broke the encoding.Muffle
and this is called awesomeness!!Footton
it does not need to be this elaborate. see my answer further down. htmlspecialchars_decode( htmlentities( html_entity_decode( $string ) ) );Bula
This stopped working in PHP 5.4 because get_html_translation_table now returns UTF-8 by default. You can specify a different encoding if you want, but just removing the utf8_encode from this answer makes it work again.Nuncupative
A
18

Might not be terribly efficient, but it works

$sample = '<p><font style="color:#FF0000">Camión español</font></p>';

echo htmlspecialchars_decode(
    htmlentities($sample, ENT_NOQUOTES, 'UTF-8', false)
  , ENT_NOQUOTES
);
Abound answered 1/9, 2009 at 22:28 Comment(0)
D
7

This is optimized version of the accepted answer.

$list = get_html_translation_table(HTML_ENTITIES);
unset($list['"']);
unset($list['<']);
unset($list['>']);
unset($list['&']);

$string = strtr($string, $list);
Damondamour answered 23/6, 2010 at 16:30 Comment(1)
Even more optimized: $list = get_html_translation_table(HTML_ENTITIES); unset($list['"'], $list['<'], $list['>'], $list['&']); echo strtr($val, $list);Vc
T
5

No solution short of a parser is going to be correct for all cases. Yours is a good case:

<p><font style="color:#FF0000">Camión español</font></p>

but do you also want to support:

<p><font>true if 5 < a && name == "joe"</font></p>

where you want it to come out as:

<p><font>true if 5 &lt; a &amp;&amp; name == &quot;joe&quot;</font></p>

Question: Can you do the encoding BEFORE you build the HTML. In other words can do something like:

"<p><font>" + htmlentities(inner) + "</font></p>"

You'll save yourself lots of grief if you can do that. If you can't, you'll need some way to skip encoding <, >, and " (as described above), or simply encode it all, and then undo it (eg. replace('&lt;', '<'))

Taft answered 2/9, 2009 at 4:54 Comment(0)
B
5

one-line solution with NO translation table or custom function required:

i know this is an old question, but i recently had to import a static site into a wordpress site and had to overcome this issue:

here is my solution that does not require fiddling with translation tables:

htmlspecialchars_decode( htmlentities( html_entity_decode( $string ) ) );

when applied to the OP's string:

<p><font style="color:#FF0000">Camión español</font></p>

output:

<p><font style="color:#FF0000">Cami&oacute;n espa&ntilde;ol</font></p>

when applied to Luca's string:

<b>Is 1 < 4?</b>è<br><i>"then"</i> <div style="some:style;"><p>gain some <strong>€</strong><img src="/some/path" /></p></div>

output:

<b>Is 1 < 4?</b>&egrave;<br><i>"then"</i> <div style="some:style;"><p>gain some <strong>&euro;</strong><img src="/some/path" /></p></div>
Bula answered 4/1, 2017 at 7:35 Comment(3)
I found this simple line to be working fine indeed. It's often good to scroll down to the recenter answers instead of chosen answers.Marja
This should only be htmlspecialchars_decode( htmlentities( $string ) ); - if you don't remove that third function call, the input string <p>The HTML you want is "1 &gt; 0"</p> becomes <p>The HTML you want is "1 > 0"</p>, which is incorrect and could be a security hole.Premedical
@M Somerville -- incorrect... please read the OPs needs completely first. the point is to take a string that may have entities already encoded and turn it into a string with HTML entities where they should be WHILE maintaining HTML markup -- thus you MUST have html_entity_decode() -- can you explain the security risk to converting HTML? there is no mention of a public-facing form submission, for example... edge cases are not within the scope of this post -- perhaps you can start one that addresses what you're pointing outBula
H
3

This is a function I've just written which solves this problem in a very elegant way:

First of all, the HTML tags will be extracted from the string, then htmlentities() is executed on every remaining substring and after that the original HTML tags will be inserted at their old position thus resulting in no alternation of the HTML tags. :-)

Have fun:

function htmlentitiesOutsideHTMLTags ($htmlText)
{
    $matches = Array();
    $sep = '###HTMLTAG###';

    preg_match_all("@<[^>]*>@", $htmlText, $matches);   
    $tmp = preg_replace("@(<[^>]*>)@", $sep, $htmlText);
    $tmp = explode($sep, $tmp);

    for ($i=0; $i<count($tmp); $i++)
        $tmp[$i] = htmlentities($tmp[$i]);

    $tmp = join($sep, $tmp);

    for ($i=0; $i<count($matches[0]); $i++)
        $tmp = preg_replace("@$sep@", $matches[0][$i], $tmp, 1);

    return $tmp;
}
Hydrokinetics answered 26/2, 2010 at 18:13 Comment(1)
Thank for sharing your solution! If you don't mind I made some changes to your code, please see my answer.Lyda
L
2

Based on the answer of bflesch, I did some changes to manage string containing less than sign, greater than sign and single quote or double quotes.

function htmlentitiesOutsideHTMLTags ($htmlText, $ent)
{
    $matches = Array();
    $sep = '###HTMLTAG###';

    preg_match_all(":</{0,1}[a-z]+[^>]*>:i", $htmlText, $matches);

    $tmp = preg_replace(":</{0,1}[a-z]+[^>]*>:i", $sep, $htmlText);
    $tmp = explode($sep, $tmp);

    for ($i=0; $i<count($tmp); $i++)
        $tmp[$i] = htmlentities($tmp[$i], $ent, 'UTF-8', false);

    $tmp = join($sep, $tmp);

    for ($i=0; $i<count($matches[0]); $i++)
        $tmp = preg_replace(":$sep:", $matches[0][$i], $tmp, 1);

    return $tmp;
}



Example of use:

$string = '<b>Is 1 < 4?</b>è<br><i>"then"</i> <div style="some:style;"><p>gain some <strong>€</strong><img src="/some/path" /></p></div>';
$string_entities = htmlentitiesOutsideHTMLTags($string, ENT_QUOTES | ENT_HTML401);
var_dump( $string_entities );

Output is:

string '<b>Is 1 &lt; 4?</b>&egrave;<br><i>&quot;then&quot;</i> <div style="some:style;"><p>gain some <strong>&euro;</strong><img src="/some/path" /></p></div>' (length=150)



You can pass any ent flag according to the htmlentities manual

Lyda answered 23/4, 2012 at 9:6 Comment(1)
Thanks for making me closer to solution but I used your solution with below string and this does not work exactly what I want - <a href="google.com">google</a> <p>Here is paragraph</p> <aeiou>Invalid tag specified should be displayed as it is</aeiou> <b>Bold it is</b> less than sign - aaaaa<pppppp all special chars = `,./;'[]\~!@#$%^&*()_+{}|:"<>? All <b>characters</strong>Omnipotent

© 2022 - 2024 — McMap. All rights reserved.