I've tried using the Sanitize
gem to clean a string which contains the HTML of a website.
It only removed the <script>
tags, not the JavaScript inside the script tags.
What can I use to remove the JavaScript from a page?
I've tried using the Sanitize
gem to clean a string which contains the HTML of a website.
It only removed the <script>
tags, not the JavaScript inside the script tags.
What can I use to remove the JavaScript from a page?
require 'open-uri' # included with Ruby; only needed to load HTML from a URL
require 'nokogiri' # gem install nokogiri read more at http://nokogiri.org
html = open('http://stackoverflow.com') # Get the HTML source string
doc = Nokogiri.HTML(html) # Parse the document
doc.css('script').remove # Remove <script>…</script>
puts doc # Source w/o script blocks
doc.xpath("//@*[starts-with(name(),'on')]").remove # Remove on____ attributes
puts doc # Source w/o any JavaScript
I am partial to the Loofah gem. Modified from an example in the docs:
1.9.3p0 :005 > Loofah.fragment("<span onclick='foo'>hello</span> <script>alert('OHAI')</script>").scrub!(:prune).to_s
=> "<span>hello</span> "
You might be interested in the ActiveRecord extensions Loofah provides.
It turns out that Sanitize
has an option built in (just not well documented)...
Sanitize.clean(content, :remove_contents => ['script', 'style'])
This removed all script and style tags (and their content) as I wanted.
So you need to add the sanitize
gem to your Gemfile:
gem 'sanitize`
Then bundle
And then you can do Sanitize.clean(text, remove_contents: ['script', 'style'])
I use this regular expression to get rid of <script>
and </script>
tags in embeded content and just make the tags vanish. It also gets rid of things like < script>
or < /script >
...etc... i.e. added whitespace.
post.content = post.content.gsub(/<\s*script\s*>|<\s*\/\s*script\s*>/, '')
Remove all <script>
tags and their contents:
regex = /<\s*s\s*c\s*r\s*i\s*p\s*t.*?>.*?<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>|<\s*s\s*c\s*r\s*i\s*p\s*t.*?>|<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>/im
while text =~ regex
text.gsub!(regex, '')
end
This will even take care of cases like:
<scr<script></script>ipt>alert('hello');</scr</script>ipt>
<script class='blah' >alert('hello');</script >
And other tricks. It won't, however, remove JavaScript that is executed via onload=
or onclick=
.
© 2022 - 2025 — McMap. All rights reserved.
on*
attributes? – Shoestring