How Do I use htmlspecialchars but allow only specific HTML code to pass through without getting converted?
Asked Answered
C

6

17

Here is the line of code I have which works great:

$content = htmlspecialchars($_POST['content'], ENT_QUOTES);

But what I would like to do is allow only certain types of HTML code to pass through without getting converted. Here is the list of HTML code that I would like to have pass:

<pre> </pre>
<b> </b>
<em> </em>
<u> </u>
<ul> </ul>
<li> </li>
<ol> </ol>

And as I go, I would like to also be able to add in more HTML later as I think of it. Could someone help me modify the code above so that the specified list of HTML codes above can pass through without getting converted?

Cusack answered 10/10, 2012 at 12:51 Comment(2)
Htmlspecialchars doesn't look at html, it looks at characters <, >, etc and escapes them. So you cannot do it with htmlspecialchars... maybe htmlpurifier?Runway
You cannot. But you could convert constrained whitelisted tags back afterwards, &lt;em&gt; to <em> for example.Peculation
R
14

I suppose you could do it after the fact:

// $str is the result of htmlspecialchars()
preg_replace('#&lt;(/?(?:pre|b|em|u|ul|li|ol))&gt;#', '<\1>', $str);

It allows the encoded version of <xx> and </xx> where xx is in a controlled set of allowed tags.

Refined answered 10/10, 2012 at 12:57 Comment(5)
If you also want to allow things like <u class="...">, <img src="..."> or <a href="..."> you just need to change the pattern string to '#&lt;(/?(?:pre|b|em|u|ul|li|ol)(?:.*?)?)&gt;#'Libb
@Libb What if you wanted the href to contain a certain string so not any site could be linked?Uel
@Uel for that kind of control it would be advisable to use a proper html processorOscoumbrian
It’s okay I managed to get it to work, I checked to see if it had a very specific string in the url and only then would it allow the hrefUel
@Uel You have to be careful with that, say you are looking for example.com whats to stop someone from putting this in google.com?example.com. it would still link to google.com but depending on how you are looking for example.com would probably still pass. the only way to do it is to make sure the URL starts with https://example.comAutotrophic
A
6

Or you can go with old style:

$content = htmlspecialchars($_POST['content'], ENT_QUOTES);

$turned = array( '&lt;pre&gt;', '&lt;/pre&gt;', '&lt;b&gt;', '&lt;/b&gt;', '&lt;em&gt;', '&lt;/em&gt;', '&lt;u&gt;', '&lt;/u&gt;', '&lt;ul&gt;', '&lt;/ul&gt;', '&lt;li&gt;', '&lt;/li&gt;', '&lt;ol&gt;', '&lt;/ol&gt;' );
$turn_back = array( '<pre>', '</pre>', '<b>', '</b>', '<em>', '</em>', '<u>', '</u>', '<ul>', '</ul>', '<li>', '</li>', '<ol>', '</ol>' );

$content = str_replace( $turned, $turn_back, $content );
Apocrypha answered 10/10, 2012 at 12:59 Comment(2)
I think this (for a small amount of tags and content) could be actually much more faster :)Glucose
@Glucose Very much agreed!Swarts
V
2

I improved the way Jack attacks this issue. I added support for <br>, <br/> and anchor tags. The code will replace fist href=&quot;...&quot; to allow only this attribute to be used.

$str = preg_replace(
    array('#href=&quot;(.*)&quot;#', '#&lt;(/?(?:pre|a|b|br|em|u|ul|li|ol)(\shref=".*")?/?)&gt;#' ), 
    array( 'href="\1"', '<\1>' ), 
    $str
);
Vaticinal answered 13/2, 2015 at 17:24 Comment(0)
S
1

I made this function to sanitize all HTML special characters except for the HTML tags specified.

It first uses htmlspecialchars() to make the string safe, then it reverts the tags I want to be untouched.

The function supports attribute filtering as an option, however be careful to disable it if you care about possible XSS attacks.

I know regex is not efficient but for moderate string lengths it should be fine. You can check the regex I used here https://regex101.com/r/U6GQse/8

public function sanitizeHtml($string, $safeHtmlTags = array('b','i','u','br'), $filterAttributes = true)
{
    $string = htmlspecialchars($string);

    if ($filterAttributes) {
        $replace = "<$1$2$4>";
    } else {
        $replace = "<$1$2$3$4>";
    }
    $string = preg_replace("/&lt;\s*(\/?\s*)(".implode("|", $safeHtmlTags).")(\s?|\s+[\s\S]*?)(\/)?\s*&gt;/", $replace, $string);

    return $string;
}

// Example usage to answer the OP question
$str = "MY HTML CONTENT"
echo sanitizeHtml($str, array('pre','b','em','u','ul','li','ol'));
Sorehead answered 17/12, 2019 at 15:18 Comment(0)
G
0

I liked Elwin's solution, but you probably want to:

  1. Prevent Javascript: URL's in the href - or more likely: allow only http(s).
  2. Make the regex globs non-greedy in case there are multiple <a href>'s in the content.

Here is the updated version:

$str = preg_replace(
    array('#href=&quot;(https?://.*?)&quot;#', '#&lt;(/?(?:pre|a|b|br|em|u|ul|li|ol)(\shref=".*?")?/?)&gt;#' ), 
    array( 'href="\1"', '<\1>' ), 
    $str
);
Geometrician answered 11/12, 2019 at 10:17 Comment(0)
F
-3

You could use strip_tags

$exceptionString = '<pre>,</pre>,<b>,</b>,<em>,</em>,<u>,</u>,<ul>,</ul>,<li>,</li>,<ol>,</ol>';

$content = strip_tags($_POST['content'],$exceptionString );
Ferry answered 10/10, 2012 at 13:2 Comment(1)
That's not exactly what OP is asking for.Apocrypha

© 2022 - 2024 — McMap. All rights reserved.