Remove all JavaScript from an HTML page
Asked Answered
D

7

12

I've tried using the Sanitize gem to clean a string which contains the HTML of a website.

It only removed the <script> tags, not the JavaScript inside the script tags.

What can I use to remove the JavaScript from a page?

Duplicate answered 28/11, 2011 at 5:18 Comment(1)
Do you also want to remove all on* attributes?Shoestring
S
13
require 'open-uri'      # included with Ruby; only needed to load HTML from a URL
require 'nokogiri'      # gem install nokogiri   read more at http://nokogiri.org

html = open('http://stackoverflow.com')              # Get the HTML source string
doc = Nokogiri.HTML(html)                            # Parse the document

doc.css('script').remove                             # Remove <script>…</script>
puts doc                                             # Source w/o script blocks

doc.xpath("//@*[starts-with(name(),'on')]").remove   # Remove on____ attributes
puts doc                                             # Source w/o any JavaScript
Shoestring answered 28/11, 2011 at 17:6 Comment(1)
This seems like a really bad idea if your intention is to prevent XSS attacks. There are all sorts of edge cases you're missing. owasp.org/index.php/XSS_Filter_Evasion_Cheat_SheetArrear
C
6

I am partial to the Loofah gem. Modified from an example in the docs:

1.9.3p0 :005 > Loofah.fragment("<span onclick='foo'>hello</span> <script>alert('OHAI')</script>").scrub!(:prune).to_s
 => "<span>hello</span> " 

You might be interested in the ActiveRecord extensions Loofah provides.

Cleocleobulus answered 28/11, 2011 at 5:37 Comment(0)
D
6

It turns out that Sanitize has an option built in (just not well documented)...

Sanitize.clean(content, :remove_contents => ['script', 'style'])

This removed all script and style tags (and their content) as I wanted.

Duplicate answered 28/11, 2011 at 21:30 Comment(0)
B
1

So you need to add the sanitize gem to your Gemfile:

gem 'sanitize`

Then bundle

And then you can do Sanitize.clean(text, remove_contents: ['script', 'style'])

Borecole answered 20/8, 2014 at 18:20 Comment(0)
I
0

I use this regular expression to get rid of <script> and </script> tags in embeded content and just make the tags vanish. It also gets rid of things like < script> or < /script > ...etc... i.e. added whitespace.

post.content = post.content.gsub(/<\s*script\s*>|<\s*\/\s*script\s*>/, '')

Innervate answered 28/5, 2016 at 23:50 Comment(0)
A
0

remove all script tags

html_content = html_content.gsub(/<script.*?>[\s\S]*<\/script>/i, "")

source

Australian answered 23/7, 2018 at 19:50 Comment(0)
L
0

Remove all <script> tags and their contents:

regex = /<\s*s\s*c\s*r\s*i\s*p\s*t.*?>.*?<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>|<\s*s\s*c\s*r\s*i\s*p\s*t.*?>|<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>/im
while text =~ regex
  text.gsub!(regex, '')
end

This will even take care of cases like:

<scr<script></script>ipt>alert('hello');</scr</script>ipt>
<script class='blah'  >alert('hello');</script  >

And other tricks. It won't, however, remove JavaScript that is executed via onload= or onclick=.

Lezlielg answered 3/8, 2022 at 13:12 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.