remove script tag from HTML content

Asked 20/8, 2011 at 9:18 Answered 6/5, 2020 at 14:18

Solved php regex htmlpurifier

I am using HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script> tags only. I don't want to remove inline formatting or any other things.

How can I achieve this?

One more thing, it there any other way to remove script tags from HTML

Papyraceous answered 20/8, 2011 at 9:18 Comment(13)

Keep in mind that script tags are not the only vulnerable parts of HTML. – Direful 20/8, 2011 at 9:22

Yes, I know about other vulnerable parts too, but I just need to remove script tags – Papyraceous 20/8, 2011 at 9:24

Read this. It will help you – Deberadeberry 20/8, 2011 at 9:28

@Jose hell no. read this #1732848 no regex for parsing html – Mersey 20/8, 2011 at 9:47

This question was already asked many times e.g. here or here, but beware of that. – Torietorii 20/8, 2011 at 10:0

@Rikudo Well... if he needs to use regexp to remove html tags... there should be a reason. Thanks for that link! – Deberadeberry 20/8, 2011 at 10:3

@Jose the reason is not being familier with other better tools. It's the exact same reason people are still using mysql_* funtions in php. – Mersey 20/8, 2011 at 10:6

@Rikudo Sennin -- or PHP at all. :) – Russian 20/8, 2011 at 10:7

@Malvolio nahhh, that's going a bit too far now :P – Mersey 20/8, 2011 at 10:8

@Rikudo Using regex for html parsing has it's own advantages and disadvantages. Its usefulness depends on particular situation. Don't be so fanatic. The world is much more complex and the same rule can't be used for all purposes. Yes, in many cases regex is not the best tool for HTML parsing, but this doesn't mean anything. – Direful 20/8, 2011 at 10:16

Obviously, however, in most cases, it's very inefficient and insecure to use a regex. It's very problematic to use a parser that does not understand the language its parsing. That's why there are specific HTML and XML parsers. – Mersey 20/8, 2011 at 10:18

@Rikudo You are trying to use one rule for everything :) Latter you'll see that not everything is so simple. – Direful 20/8, 2011 at 10:25

Regarding the html parser vs. regex debate - you probably need both; be aware that an html parser will not recognize conditional comments which means that IE will happily render script tags therein. The general problem with solving this in an elegant way is that the browsers don't care... – Bract 18/1, 2013 at 15:20

164

Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

Monocyclic answered 20/8, 2011 at 10:15 Comment(21)

-1 for RegExp solution. See this discussion. – Dumortierite 20/8, 2011 at 10:20

I saw that discussion long time ago, you should read it, not just see it. – Occlude 20/8, 2011 at 10:23

While I appreciate your aloof response, my reasoning for disapproving your answer is sound. See this gist for a crafted script tag which circumvents your regex. In fairness, it is arguably more of shortcoming of your particular regular expression than a reason to abandon regex altogether. But, interesting to me all the same. – Dumortierite 7/12, 2011 at 23:53

This particular regex is vulnerable to javascript injection. – Rissa 31/3, 2012 at 5:1

@ParijatKalia it's a stupid idea to display remote HTML with or without script anyways, what difference does it makes? If you are absolutely sure about the content, I doubt you'll run into a HTML like you've written. Btw, I answered with regex only because the questions was tagged like so. – Occlude 23/4, 2013 at 19:59

If you want to take the regex route, make sure you run prey_replace multiple times until the output doesn't change anymore (catches example input from @ParijatKalia). – Blackwell 22/8, 2013 at 12:22

Just out of interest why do you have two foreachloops? Why not just foreach($scripts as $script){$script->parentNode->removeChild($script);}? – Monstrosity 16/12, 2014 at 16:56

@Monstrosity because you will not get correct results (iterator doesn't behave like it's expected), see this comment. – Occlude 16/12, 2014 at 21:20

@webarto Thanks for your reply, particularly the ref! – Monstrosity 17/12, 2014 at 10:55

why is the #is for on the regex? – Katmandu 18/12, 2014 at 18:23

For sake of argument. Sometimes it IS necessary to use regex to strip tags from content. Sure, we all know this is bad but sometimes you HAVE to use regex. The DOMDocument will not work unless it is HTML. But let's say you are importing content from Drupal to WordPress... DOMDocument will not work as this is not true HTML in the content but just text with markup in it. This is when you HAVE to use regex as you want to keep most tags but remove script tags as they shouldn't be there anyways. So sure, use DOMDocument if you can but to say you shouldn't use regex to do this is just ignorant. – Lilla 9/2, 2015 at 19:27

You regexp haters are acting like DOMDocument is safer. It's not. – Whiffen 17/3, 2016 at 3:28

how do you get the DOMDocument parser to not add the Doctype, HTML and BODY tags? – Nathalienathan 17/6, 2016 at 20:2

Thanks for the answer, but I second Mike comment above. If I'm working with an HTML snippet, I wouldn't appreciate to have other stuff added around like saveHTML apparently does. – Likely 3/11, 2016 at 15:24

In the regex solution i think you should escape / in </script as otherwise it will treat the end as modifiers: "ERROR: Unknown modifier 'c'" – Filigree 25/11, 2016 at 9:10

To avoid adding DOCTYPE, html and body tags, see this answer. – Spotter 30/10, 2017 at 7:30

'~<script[^>]*>.*</script\s*>~is' – Remembrance 26/3, 2018 at 14:45

Note that this breaks DOMDocument parsing when using loadHTML() because of the HTML markup in a Javascript string: <div> <script> var str = '</div>this does NOT get removed'; </script> </div> – Guillemot 28/9, 2018 at 18:57

saveHtml() will add extra unnecessary html to the string ie: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "w3.org/TR/REC-html40/loose.dtd"> <html><body><p> for more info see 3v4l.org/1TNHP – Disinfest 2/1, 2020 at 21:20

What about <SCRIPT>alert(123)</SCRIPT> uppercased or mixed tags? – Chayachayote 8/4, 2020 at 23:54

The DOMDocument solution does not work for me, it puts the <p> inside the <h1> tag, thus messing up the whole html – Malikamalin 17/6, 2020 at 11:17

Use the PHP DOMDocument parser.

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

This worked me me using the following HTML document:

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

Dumortierite answered 20/8, 2011 at 10:3 Comment(10)

+0 I'm sick of hearing about that discussion regarding regex and HTML. In some very special occasions it should be OK to use regex. In my case, I'm getting this error: Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag myCustomTag invalid in Entity. Tried everything. All I want to do is remove script tags for one tiny part of the application (without spending any more time on it). I'm going to use preg_replace and that is that. I don't wanna hear another word about it. :) – Spake 6/12, 2011 at 21:39

See my comment to the chosen best answer. I would prefer to see coders cover general cases, as malicious users can get very clever. However, you are right: in developing an internal application, for instance, it could be considered OK to ignore such vulnerabilities and use regex. – Dumortierite 7/12, 2011 at 23:55

@Xeoncross Thanks! I'll give that a try next time I get a chance to work on this. At the moment I'm busy with other code and don't wanna have to dig that stuff up :). – Spake 10/2, 2012 at 3:22

DOMDocument and SimpleXML can be used to load files outside of your document root. Use libxml_disable_entity_loader(true) to disable this feature of libxml. php.net/manual/en/function.libxml-disable-entity-loader.php – Orourke 19/7, 2012 at 20:19

this code will give 'Fatal error: Call to a member function removeChild() on null' once you have an empty tag, like <script src="..."></script> – Groveman 1/7, 2015 at 13:36

@Spi Interesting. Do you know how to amend the code to fix that? – Dumortierite 2/7, 2015 at 11:30

@SPi I kept getting the same errors. This worked for me (still, used yours as a base, so thanks...):

// load HTML     $dom = new DOMDocument;     $dom->loadHTML($html_to_parse);      // remove all scripts     while (true) {       $script = $dom->getElementsByTagName('script')->item(0);       if ($script != NULL) {         $script->parentNode->removeChild($script);       }       else {         break;       }     }

– Gillispie 7/9, 2016 at 14:23

Thanks for the update @MatthewKolb. Shame it doesn't work anymore (what PHP version are you using?); do you know if there's something more appropriate? – Dumortierite 30/9, 2018 at 9:9

@Dumortierite I'm using php 5.6.35. Your example still works great - as long as the JS does not include HTML tags. I've read loadXML() would better be able to handle this type of case, but it appears it just fails to load the DOM at all since it considers the input to be invalid XML. I haven't found a better solution than to use REGEX to strip scripts before loading into DOMDocument – Guillemot 1/10, 2018 at 14:17

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();

Thanatos answered 24/1, 2018 at 7:59 Comment(2)

I upvoted this response because for one thing it's clean and simple, and it also reminded me that iframes could also cause me trouble. – Culberson 6/12, 2018 at 14:44

Also, I just realized, this adds doctype, html and body tags, which is okay for the current question, but was not okay for me, but I only had to change one line (as the top comment says on the saveHTML php.net page): $dom->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD); – Culberson 6/12, 2018 at 14:59

Shorter:

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

Sixpenny answered 25/3, 2015 at 7:43 Comment(0)

A simple way by manipulating string.

function stripStr($str, $ini, $fin)
{
    while (($pos = mb_stripos($str, $ini)) !== false) {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos);
        
        if (($pos2 = mb_stripos($aux, $fin)) !== false) {
            $str .= mb_substr($aux, $pos2 + mb_strlen($fin));
        }
    }

    return $str;
}

Nw answered 31/10, 2018 at 12:20 Comment(3)

@Someone_who_likes_SE Yes, sure. You can use stripos and substr instead of mb_stripos and mb_substr, but I prefer to use MB functions, they are more reliable. – Mireielle 22/7, 2021 at 23:59

This is all fine, but there is a serious flaw here. Mind you, you do not know which input you have. If $fin not in $str (or $aux), you have a perfect loop here. Happy debugging! There are several options to tweak this code to cope for that flaw. I'll leave it to you to fix it. – Bonnette 16/8, 2021 at 19:29

@Bonnette I have modified it, now if $fin is not found, it cuts from $ini to the end of the string. Regards! – Mireielle 18/8, 2021 at 8:46

Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:


/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */

/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */

if (!function_exists('removeAllInstancesOfTag'))
    {
        function removeAllInstancesOfTag($html, $tag_nm)
            {
                if (!empty($html))
                    {
                        $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
                        $doc = new DOMDocument();
                        $doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);

                        if (!empty($tag_nm))
                            {
                                if (is_array($tag_nm))
                                    {
                                        $tag_nms = $tag_nm;
                                        unset($tag_nm);

                                        foreach ($tag_nms as $tag_nm)
                                            {
                                                $rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
                                                $rmvbl_itms_arr = [];

                                                foreach ($rmvbl_itms as $itm)
                                                    {
                                                        $rmvbl_itms_arr[] = $itm;
                                                    }

                                                foreach ($rmvbl_itms_arr as $itm)
                                                    {
                                                        $itm->parentNode->removeChild($itm);
                                                    }
                                            }
                                    }
                                else if (is_string($tag_nm))
                                    {
                                        $rmvbl_itms = $doc->getElementsByTagName($tag_nm);
                                        $rmvbl_itms_arr = [];

                                        foreach ($rmvbl_itms as $itm)
                                            {
                                                $rmvbl_itms_arr[] = $itm;
                                            }

                                        foreach ($rmvbl_itms_arr as $itm)
                                            {
                                                $itm->parentNode->removeChild($itm); 
                                            }
                                    }
                            }

                        return $doc->saveHTML();
                    }
                else
                    {
                        return '';
                    }
            }
    }

/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */

/* Prerequisites: 'removeAllInstancesOfTag(...)' */

if (!function_exists('removeAllScriptTags'))
    {
        function removeAllScriptTags($html)
            {
                return removeAllInstancesOfTag($html, 'script');
            }
    }

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */

And here is a test usage example:


$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);

I hope my answer really helps someone. Enjoy!

Analphabetic answered 6/5, 2020 at 14:18 Comment(0)

this is a merge of both ClandestineCoder & Binh WPO.

the problem with the script tag arrows is that they can have more than one variant

ex. (< = < = &lt;) & ( > = > = &gt;)

so instead of creating a pattern array with like a bazillion variant, imho a better solution would be

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

Comedo answered 31/7, 2016 at 22:1 Comment(0)

function remove_script_tags($html){
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $script = $dom->getElementsByTagName('script');

    $remove = [];
    foreach($script as $item){
        $remove[] = $item;
    }

    foreach ($remove as $item){
        $item->parentNode->removeChild($item);
    }

    $html = $dom->saveHTML();
    $html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
    $html = str_replace('</p></body></html>', '', $html);
    return $html;
}

Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP

Disinfest answered 2/1, 2020 at 21:35 Comment(2)

No, it's loadHTML(...) function that adds that. See LIBXML_HTML_NODEFDTD and LIBXML_HTML_NOIMPLIED here: php.net/manual/en/libxml.constants.php – Analphabetic 8/5, 2020 at 14:1

ok thanks James for the clarification! – Disinfest 12/3, 2022 at 18:17

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;

If you are using php 7 you can use the null coalesce operator to simplify it even more.

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str);

Compartmentalize answered 22/3, 2017 at 19:5 Comment(1)

This does have one down fall which is if someone uses files from a script folder in the html like: <img src="/script/email/img.jpg">.. <img src="/script/email/img-0.jpg">. This will create a catch that will delete everything in between them. – Compartmentalize 24/3, 2017 at 18:5

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.

Don't try to do it with regexps. That way lies madness.

Russian answered 20/8, 2011 at 10:6 Comment(4)

Why not use regex for this simple operation? – Occlude 20/8, 2011 at 10:11

@webarto See this discussion – Dumortierite 20/8, 2011 at 10:20

@Alex, I know that, but why not use it here? – Occlude 20/8, 2011 at 10:27

Because of the answer I've linked to. It's not safe or any sort of guarantee. An HTML/XML is a far better solution. – Dumortierite 20/8, 2011 at 10:32

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

        $counter = $k;//match opening tag and start counter for backtrace

        }elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

            $script_length = $k - $counter;

            $counter = 0;

            for($i = $script_length; $i >= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.

I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

Detoxify answered 15/4, 2013 at 4:27 Comment(1)

You should not use Regex for finding script tags in the HTML code. Use DOMDocument to parse the entire document and find the script tags to remove – Schrader 2/9, 2021 at 8:27

This is a simplified variant of Dejan Marjanovic's answer:

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

Can be used to remove any kind of tag, including <script>:

$scriptlessHtml = removeTags($html, 'script');

Bolden answered 21/1, 2018 at 0:1 Comment(0)

use the str_replace function to replace them with empty space or something

$query = '<script>console.log("I should be banned")</script>';

$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);

echo $query; 
//this echoes console.log("I should be banned")

Wallen answered 29/6, 2018 at 19:58 Comment(3)

I don't know why people keep arguing over DOMDocument and some kind of regex as the "solution" vs "not the solution". I like this guy's answer -- to simply use php's str_replace (but I'd use str_ireplace due to case-insensitivity). Unless you have a ton of stuff you want to remove, this seems to be the simplest and most effective solution. I tell my users that can't paste or type that kind of stuff. If they do, then tough luck -- it will be removed. – Minta 27/10, 2018 at 20:16

This solution keeps javascript code inside the html string. This is a joke, not a good solution! However, you can go far and remove from "<script" to "</script>". That it could be a nice solution. – Mireielle 31/10, 2018 at 12:6

ireplace "<SCRIPT" with "<!--" and "</SCRIPT>" with "--!>" would be better – Analisaanalise 28/5, 2021 at 22:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags