remove script tag from HTML content
Asked Answered
P

13

74

I am using HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script> tags only. I don't want to remove inline formatting or any other things.

How can I achieve this?

One more thing, it there any other way to remove script tags from HTML

Papyraceous answered 20/8, 2011 at 9:18 Comment(13)
Keep in mind that script tags are not the only vulnerable parts of HTML.Direful
Yes, I know about other vulnerable parts too, but I just need to remove script tagsPapyraceous
Read this. It will help youDeberadeberry
@Jose hell no. read this #1732848 no regex for parsing htmlMersey
This question was already asked many times e.g. here or here, but beware of that.Torietorii
@Rikudo Well... if he needs to use regexp to remove html tags... there should be a reason. Thanks for that link!Deberadeberry
@Jose the reason is not being familier with other better tools. It's the exact same reason people are still using mysql_* funtions in php.Mersey
@Rikudo Sennin -- or PHP at all. :)Russian
@Malvolio nahhh, that's going a bit too far now :PMersey
@Rikudo Using regex for html parsing has it's own advantages and disadvantages. Its usefulness depends on particular situation. Don't be so fanatic. The world is much more complex and the same rule can't be used for all purposes. Yes, in many cases regex is not the best tool for HTML parsing, but this doesn't mean anything.Direful
Obviously, however, in most cases, it's very inefficient and insecure to use a regex. It's very problematic to use a parser that does not understand the language its parsing. That's why there are specific HTML and XML parsers.Mersey
@Rikudo You are trying to use one rule for everything :) Latter you'll see that not everything is so simple.Direful
Regarding the html parser vs. regex debate - you probably need both; be aware that an html parser will not recognize conditional comments which means that IE will happily render script tags therein. The general problem with solving this in an elegant way is that the browsers don't care...Bract
M
164

Because this question is tagged with I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

Monocyclic answered 20/8, 2011 at 10:15 Comment(21)
-1 for RegExp solution. See this discussion.Dumortierite
I saw that discussion long time ago, you should read it, not just see it.Occlude
While I appreciate your aloof response, my reasoning for disapproving your answer is sound. See this gist for a crafted script tag which circumvents your regex. In fairness, it is arguably more of shortcoming of your particular regular expression than a reason to abandon regex altogether. But, interesting to me all the same.Dumortierite
This particular regex is vulnerable to javascript injection.Rissa
@ParijatKalia it's a stupid idea to display remote HTML with or without script anyways, what difference does it makes? If you are absolutely sure about the content, I doubt you'll run into a HTML like you've written. Btw, I answered with regex only because the questions was tagged like so.Occlude
If you want to take the regex route, make sure you run prey_replace multiple times until the output doesn't change anymore (catches example input from @ParijatKalia).Blackwell
Just out of interest why do you have two foreachloops? Why not just foreach($scripts as $script){$script->parentNode->removeChild($script);}?Monstrosity
@Monstrosity because you will not get correct results (iterator doesn't behave like it's expected), see this comment.Occlude
@webarto Thanks for your reply, particularly the ref!Monstrosity
why is the #is for on the regex?Katmandu
For sake of argument. Sometimes it IS necessary to use regex to strip tags from content. Sure, we all know this is bad but sometimes you HAVE to use regex. The DOMDocument will not work unless it is HTML. But let's say you are importing content from Drupal to WordPress... DOMDocument will not work as this is not true HTML in the content but just text with markup in it. This is when you HAVE to use regex as you want to keep most tags but remove script tags as they shouldn't be there anyways. So sure, use DOMDocument if you can but to say you shouldn't use regex to do this is just ignorant.Lilla
You regexp haters are acting like DOMDocument is safer. It's not.Whiffen
how do you get the DOMDocument parser to not add the Doctype, HTML and BODY tags?Nathalienathan
Thanks for the answer, but I second Mike comment above. If I'm working with an HTML snippet, I wouldn't appreciate to have other stuff added around like saveHTML apparently does.Likely
In the regex solution i think you should escape / in </script as otherwise it will treat the end as modifiers: "ERROR: Unknown modifier 'c'"Filigree
To avoid adding DOCTYPE, html and body tags, see this answer.Spotter
'~<script[^>]*>.*</script\s*>~is'Remembrance
Note that this breaks DOMDocument parsing when using loadHTML() because of the HTML markup in a Javascript string: <div> <script> var str = '</div>this does NOT get removed'; </script> </div>Guillemot
saveHtml() will add extra unnecessary html to the string ie: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "w3.org/TR/REC-html40/loose.dtd"> <html><body><p> for more info see 3v4l.org/1TNHPDisinfest
What about <SCRIPT>alert(123)</SCRIPT> uppercased or mixed tags?Chayachayote
The DOMDocument solution does not work for me, it puts the <p> inside the <h1> tag, thus messing up the whole htmlMalikamalin
D
44

Use the PHP DOMDocument parser.

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

This worked me me using the following HTML document:

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

Dumortierite answered 20/8, 2011 at 10:3 Comment(10)
+0 I'm sick of hearing about that discussion regarding regex and HTML. In some very special occasions it should be OK to use regex. In my case, I'm getting this error: Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag myCustomTag invalid in Entity. Tried everything. All I want to do is remove script tags for one tiny part of the application (without spending any more time on it). I'm going to use preg_replace and that is that. I don't wanna hear another word about it. :)Spake
See my comment to the chosen best answer. I would prefer to see coders cover general cases, as malicious users can get very clever. However, you are right: in developing an internal application, for instance, it could be considered OK to ignore such vulnerabilities and use regex.Dumortierite
@Xeoncross Thanks! I'll give that a try next time I get a chance to work on this. At the moment I'm busy with other code and don't wanna have to dig that stuff up :).Spake
DOMDocument and SimpleXML can be used to load files outside of your document root. Use libxml_disable_entity_loader(true) to disable this feature of libxml. php.net/manual/en/function.libxml-disable-entity-loader.phpOrourke
this code will give 'Fatal error: Call to a member function removeChild() on null' once you have an empty tag, like <script src="..."></script>Groveman
@Spi Interesting. Do you know how to amend the code to fix that?Dumortierite
@SPi I kept getting the same errors. This worked for me (still, used yours as a base, so thanks...): // load HTML $dom = new DOMDocument; $dom->loadHTML($html_to_parse); // remove all scripts while (true) { $script = $dom->getElementsByTagName('script')->item(0); if ($script != NULL) { $script->parentNode->removeChild($script); } else { break; } }Gillispie
Note that this breaks DOMDocument parsing when using loadHTML() because of the HTML markup in a Javascript string: <div> <script> var str = '</div>this does NOT get removed'; </script> </div>Guillemot
Thanks for the update @MatthewKolb. Shame it doesn't work anymore (what PHP version are you using?); do you know if there's something more appropriate?Dumortierite
@Dumortierite I'm using php 5.6.35. Your example still works great - as long as the JS does not include HTML tags. I've read loadXML() would better be able to handle this type of case, but it appears it just fails to load the DOM at all since it considers the input to be invalid XML. I haven't found a better solution than to use REGEX to strip scripts before loading into DOMDocumentGuillemot
T
7
$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();
Thanatos answered 24/1, 2018 at 7:59 Comment(2)
I upvoted this response because for one thing it's clean and simple, and it also reminded me that iframes could also cause me trouble.Culberson
Also, I just realized, this adds doctype, html and body tags, which is okay for the current question, but was not okay for me, but I only had to change one line (as the top comment says on the saveHTML php.net page): $dom->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);Culberson
S
4

Shorter:

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

Sixpenny answered 25/3, 2015 at 7:43 Comment(0)
N
4

A simple way by manipulating string.

function stripStr($str, $ini, $fin)
{
    while (($pos = mb_stripos($str, $ini)) !== false) {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos);
        
        if (($pos2 = mb_stripos($aux, $fin)) !== false) {
            $str .= mb_substr($aux, $pos2 + mb_strlen($fin));
        }
    }

    return $str;
}
Nw answered 31/10, 2018 at 12:20 Comment(3)
@Someone_who_likes_SE Yes, sure. You can use stripos and substr instead of mb_stripos and mb_substr, but I prefer to use MB functions, they are more reliable.Mireielle
This is all fine, but there is a serious flaw here. Mind you, you do not know which input you have. If $fin not in $str (or $aux), you have a perfect loop here. Happy debugging! There are several options to tweak this code to cope for that flaw. I'll leave it to you to fix it.Bonnette
@Bonnette I have modified it, now if $fin is not found, it cuts from $ini to the end of the string. Regards!Mireielle
A
4

Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:


/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */

/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */

if (!function_exists('removeAllInstancesOfTag'))
    {
        function removeAllInstancesOfTag($html, $tag_nm)
            {
                if (!empty($html))
                    {
                        $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
                        $doc = new DOMDocument();
                        $doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);

                        if (!empty($tag_nm))
                            {
                                if (is_array($tag_nm))
                                    {
                                        $tag_nms = $tag_nm;
                                        unset($tag_nm);

                                        foreach ($tag_nms as $tag_nm)
                                            {
                                                $rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
                                                $rmvbl_itms_arr = [];

                                                foreach ($rmvbl_itms as $itm)
                                                    {
                                                        $rmvbl_itms_arr[] = $itm;
                                                    }

                                                foreach ($rmvbl_itms_arr as $itm)
                                                    {
                                                        $itm->parentNode->removeChild($itm);
                                                    }
                                            }
                                    }
                                else if (is_string($tag_nm))
                                    {
                                        $rmvbl_itms = $doc->getElementsByTagName($tag_nm);
                                        $rmvbl_itms_arr = [];

                                        foreach ($rmvbl_itms as $itm)
                                            {
                                                $rmvbl_itms_arr[] = $itm;
                                            }

                                        foreach ($rmvbl_itms_arr as $itm)
                                            {
                                                $itm->parentNode->removeChild($itm); 
                                            }
                                    }
                            }

                        return $doc->saveHTML();
                    }
                else
                    {
                        return '';
                    }
            }
    }

/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */

/* Prerequisites: 'removeAllInstancesOfTag(...)' */

if (!function_exists('removeAllScriptTags'))
    {
        function removeAllScriptTags($html)
            {
                return removeAllInstancesOfTag($html, 'script');
            }
    }

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */


And here is a test usage example:


$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);

I hope my answer really helps someone. Enjoy!

Analphabetic answered 6/5, 2020 at 14:18 Comment(0)
C
3
  • this is a merge of both ClandestineCoder & Binh WPO.

the problem with the script tag arrows is that they can have more than one variant

ex. (< = &lt; = &amp;lt;) & ( > = &gt; = &amp;gt;)

so instead of creating a pattern array with like a bazillion variant, imho a better solution would be

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

Comedo answered 31/7, 2016 at 22:1 Comment(0)
D
3
function remove_script_tags($html){
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $script = $dom->getElementsByTagName('script');

    $remove = [];
    foreach($script as $item){
        $remove[] = $item;
    }

    foreach ($remove as $item){
        $item->parentNode->removeChild($item);
    }

    $html = $dom->saveHTML();
    $html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
    $html = str_replace('</p></body></html>', '', $html);
    return $html;
}

Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP

Disinfest answered 2/1, 2020 at 21:35 Comment(2)
No, it's loadHTML(...) function that adds that. See LIBXML_HTML_NODEFDTD and LIBXML_HTML_NOIMPLIED here: php.net/manual/en/libxml.constants.phpAnalphabetic
ok thanks James for the clarification!Disinfest
C
2

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;  

If you are using php 7 you can use the null coalesce operator to simplify it even more.

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str); 
Compartmentalize answered 22/3, 2017 at 19:5 Comment(1)
This does have one down fall which is if someone uses files from a script folder in the html like: <img src="/script/email/img.jpg">.. <img src="/script/email/img-0.jpg">. This will create a catch that will delete everything in between them.Compartmentalize
R
1

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.

Don't try to do it with regexps. That way lies madness.

Russian answered 20/8, 2011 at 10:6 Comment(4)
Why not use regex for this simple operation?Occlude
@webarto See this discussionDumortierite
@Alex, I know that, but why not use it here?Occlude
Because of the answer I've linked to. It's not safe or any sort of guarantee. An HTML/XML is a far better solution.Dumortierite
D
1

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

        $counter = $k;//match opening tag and start counter for backtrace

        }elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

            $script_length = $k - $counter;

            $counter = 0;

            for($i = $script_length; $i >= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.

I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

Detoxify answered 15/4, 2013 at 4:27 Comment(1)
You should not use Regex for finding script tags in the HTML code. Use DOMDocument to parse the entire document and find the script tags to removeSchrader
B
1

This is a simplified variant of Dejan Marjanovic's answer:

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

Can be used to remove any kind of tag, including <script>:

$scriptlessHtml = removeTags($html, 'script');
Bolden answered 21/1, 2018 at 0:1 Comment(0)
W
1

use the str_replace function to replace them with empty space or something

$query = '<script>console.log("I should be banned")</script>';

$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);

echo $query; 
//this echoes console.log("I should be banned")

?>

Wallen answered 29/6, 2018 at 19:58 Comment(3)
I don't know why people keep arguing over DOMDocument and some kind of regex as the "solution" vs "not the solution". I like this guy's answer -- to simply use php's str_replace (but I'd use str_ireplace due to case-insensitivity). Unless you have a ton of stuff you want to remove, this seems to be the simplest and most effective solution. I tell my users that can't paste or type that kind of stuff. If they do, then tough luck -- it will be removed.Minta
This solution keeps javascript code inside the html string. This is a joke, not a good solution! However, you can go far and remove from "<script" to "</script>". That it could be a nice solution.Mireielle
ireplace "<SCRIPT" with "<!--" and "</SCRIPT>" with "--!>" would be betterAnalisaanalise

© 2022 - 2024 — McMap. All rights reserved.