How to strip tags in a safer way than using strip_tags function?

Asked 14/2, 2011 at 18:40 Answered 16/6, 2022 at 15:11

I'm having some problems using strip_tags PHP function when the string contains 'less than' and 'greater than' signs. For example:

If I do:

strip_tags("<span>some text <5ml and then >10ml some text </span>");

I'll get:

some text 10ml some text

But, obviously I want to get:

some text <5ml and then >10ml some text

Yes I know that I could use < and >, but I don't have chance to convert those characters into HTML entities since data is already stored as you can see in my example.

What I'm looking for is a clever way to parse HTML in order to get rid only actual HTML tags.

Since TinyMCE was used for generate that data, I know which actual html tags could be used in any case, so a strip_tags($string, $black_list) implementation would be more usefull than strip_tags($string, $allowable_tags).

Any thoughs?

Tove answered 14/2, 2011 at 18:40 Comment(6)

Why is it obvious what you want to get? <anything is an opening tag, and as such should be removed. So strip_tags is doing what you're asking it to... – Alister 14/2, 2011 at 18:43

I agree with ircmaxell. Your sentence has three tags, like it or not. You will probably need a different approach. Is the source data in a consistent format? Anyway you can convert the angle brackets to their HTML encoded equivalents before stripping tags? – Kero 14/2, 2011 at 18:53

@Alister and @clifgriffin: I wrote "obviously" because semantically those signs are not part of a tag, they are meaning 'less than five milliliters' and 'greater than 10 milliliters'. – Tove 14/2, 2011 at 19:52

@ircmaxell: I'm not saying that strip_tags has a bug. I'm asking for the right way to get that I need. – Tove 14/2, 2011 at 19:57

@clifgriffin: I don't have chance to convert those characters into HTML entities since data is already stored as you can see in my example. – Tove 14/2, 2011 at 19:59

@texai: my point was that it is not obvious to a computer what you're asking for. It may feel obvious to either of us, but no programming language will free you from the burden of clarifying your own ideas. That's what I meant from that comment. – Alister 14/2, 2011 at 20:53

As a wacky workaround you could filter non-html brackets with:

$html = preg_replace("# <(?![/a-z]) | (?<=\s)>(?![a-z]) #exi", "htmlentities('$0')", $html);

Apply strip_tags() afterwards. Note how this only works for your specific example and similar cases. It's a regular expression with some heuristics, not artificial intellegince to discern html tags from unescaped angle brackets with other meaning.

Vinni answered 14/2, 2011 at 18:55 Comment(1)

since you are already using PCRE_EXTENDED you could add inline comments so we can better understand the Regex. – Haskins 14/2, 2011 at 19:12

If you want to have "greater than" and "lesser than" signs, you need to escape them:

> is >

< is <

See e.g. this: http://www.w3schools.com/html/html_entities.asp

Hankow answered 14/2, 2011 at 18:55 Comment(2)

Yes I know that, but I don't have chance to convert those characters into HTML entities since data is already stored as you can see in my example. What I'm looking for is a clever way to parse HTML in order to strip actual HTML tags – Tove 14/2, 2011 at 20:1

@texai: well, off you go to the land of guesswork and pain, otherwise known as Heuristics ;) @mario's answer looks kind of useful in this regard. – Hankow 14/2, 2011 at 20:2

Instead of strip_tags(), just use htmlspecialchars() instead.

http://php.net/manual/en/function.htmlspecialchars.php

Stillhunt answered 14/2, 2011 at 19:17 Comment(2)

This doesn't meet the requirement of replacing "<span>" with "" and "</span>" with "" – Decorous 8/7, 2016 at 9:49

htmlspecialchars() and htmlentities() will only encode the content in the string. This will not remove any tags. – Livorno 7/12, 2016 at 19:20

Following up on the accepted answer that uses a heuristic function to try to remove tags while sparing < and > signs, here is a version that uses preg_replace_callback, as the /e modifier in preg_replace is now deprecated:

function HTMLToString($string){
    return htmlspecialchars_decode(strip_tags(preg_replace_callback("# <(?![/a-z]) | (?<=\s)>(?![a-z]) #xi",    
        function ($matches){
            return (htmlentities($matches[0]));
        }
        , $string)));
}

Arabinose answered 16/6, 2022 at 15:11 Comment(0)

Recommended topics

Hot tags