PHP: strip_tags - remove only certain tags (and their contents)?

Asked 23/6, 2012 at 0:56 Answered 27/6, 2012 at 15:25

I use the strip_tags() function but I need to remove some tags (and all of their contents).

for example :

<div>
  <p class="test">
    Test A
  </p>
  <span>
    Test B
  </span>
  <div>
    Test C
  </div>
</div>

Let's say, I need to get rid of the P and SPAN tags, and only keep :

<div>
  <div>
    Test C
  </div>
</div>

strip_tags expects as a second parameter the tags that you want to KEEP.

In this particular example I could use striptags($html, "<div>"); but the html I'm scraping and the tags that need to be removed are different all the time.

I searched for hours for a function that suits my needs, but couldn't find anything useful.

Any idea's?

Thorite answered 23/6, 2012 at 0:56 Comment(3)

Start with DOM and XPath – Eellike 23/6, 2012 at 1:2

Question already answered here: #9789121 – Alforja 23/6, 2012 at 1:11

I tried the accepted answer in this post but was not satisfied with the results – Thorite 23/6, 2012 at 13:45

Use a regular expression. Something like this should work:

$tags = array( 'p', 'span');
$text = preg_replace( '#<(' . implode( '|', $tags) . ')>.*?<\/$1>#s', '', $text);

The demo shows it replacing the desired tags with nothing.

Note that you may need to tweak it more, say, to compensate for whitespace within the tags, or other unknowns that your example does not demonstrate.

Here is the regex to use to capture tags with or without attributes:

'#<(' . implode( '|', $tags) . ')(?:[^>]+)?>.*?<\/$1>#s'

Gaither answered 23/6, 2012 at 1:4 Comment(8)

@Downvoter - Any comment as to why my functional answer was downvoted? – Gaither 23/6, 2012 at 1:44

Thanks, this is perfect for my situation. I'm scraping HTML using simple html dom parser and just needed some extra stripping. – Thorite 23/6, 2012 at 13:44

UPDATE: this regex only strips single tags without attributes... the following seems to work : $text = preg_replace( '#<(' . implode( '|', $tags) . ').*>.*?</\1>#s', '', $text); – Thorite 25/6, 2012 at 23:43

@Thorite - Your examples don't include attributes - If you wanted to match them, you should be using: '#<(' . implode( '|', $tags) . ')[^>]+>.*?</\1>#s'. – Gaither 25/6, 2012 at 23:51

@Thorite - Un-accepting this answer on the basis that it didn't fulfill a requirement you never mentioned is ridiculous. It perfectly answers the question you stated, which is proven by the linked demo. – Gaither 25/6, 2012 at 23:54

I agree, I just didn't want to close this one yet, hoping for some more answers. Your regex now includes tags with attributes, but doesn't work anymore for tags without attributes now. My knowledge of regexes is sadly not well enough to fix this. – Thorite 26/6, 2012 at 23:35

update: with '#<(' . implode( '|', $tags) . ')[^>]*?>.*?</\1>#s' it seems to work for tags with & without attributes – Thorite 27/6, 2012 at 0:2

@Thorite - I've updated my answer with a regex to use that will capture tags with or without attributes. – Gaither 27/6, 2012 at 12:38

You say that you are using Simple HTML DOM (Good! That's the right way to parse HTML). When I need to remove a tag and its contents, I do:

$rows = $html->find("span");

foreach ($rows as $row)
{
  $row->outertext = "";
}

$html->load($html->save());

The last line is required because the DOM gets confused after modifications are made so the entire DOM has to be collapsed and then parsed again so that the changes are made permanent (IMO, a bug in Simple HTML DOM).

The Simple HTML DOM approach is safer and more stable than a regular expression.

Langevin answered 27/6, 2012 at 15:25 Comment(0)

Recommended topics

Hot tags