How can I strip the HTML from a string in JavaScript?
Using the browser's parser is the probably the best bet in current browsers. The following will work, with the following caveats:
- Your HTML is valid within a
<div>
element. HTML contained within<body>
or<html>
or<head>
tags is not valid within a<div>
and may therefore not be parsed correctly. textContent
(the DOM standard property) andinnerText
(non-standard) properties are not identical. For example,textContent
will include text within a<script>
element whileinnerText
will not (in most browsers). This only affects IE <=8, which is the only major browser not to supporttextContent
.- The HTML does not contain
<script>
elements. - The HTML is not
null
- The HTML comes from a trusted source. Using this with arbitrary HTML allows arbitrary untrusted JavaScript to be executed. This example is from a comment by Mike Samuel on the duplicate question:
<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>
Code:
var html = "<p>Some HTML</p>";
var div = document.createElement("div");
div.innerHTML = html;
var text = div.textContent || div.innerText || "";
textContent
and innerText
. –
Flaky div.innerHTML = html
if the value of html
is NULL
. Worse, it won't properly report the error (instead says parent function has TypeError
). Chrome/IE do not crash. –
Nutritive var text = new DOMParser().parseFromString("<p>Some HTML</p>", "text/html").body.textContent || "";
one line version –
Mosemoseley cleanText = strInputCode.replace(/<\/?[^>]+(>|$)/g, "");
Distilled from this website (web.achive).
This regex looks for <
, an optional slash /
, one or more characters that are not >
, then either >
or $
(the end of the line)
Examples:
'<div>Hello</div>' ==> 'Hello'
^^^^^ ^^^^^^
'Unterminated Tag <b' ==> 'Unterminated Tag '
^^
But it is not bulletproof:
'If you are < 13 you cannot register' ==> 'If you are '
^^^^^^^^^^^^^^^^^^^^^^^^
'<div data="score > 42">Hello</div>' ==> ' 42">Hello'
^^^^^^^^^^^^^^^^^^ ^^^^^^
If someone is trying to break your application, this regex will not protect you. It should only be used if you already know the format of your input. As other knowledgable and mostly sane people have pointed out, to safely strip tags, you must use a parser.
If you do not have acccess to a convenient parser like the DOM, and you cannot trust your input to be in the right format, you may be better off using a package like sanitize-html, and also other sanitizers are available.
<img alt="a>b" src="a_b.gif" />
–
Clouded <(?:[^><\"\']*?(?:([\"\']).*?\1)?[^><\'\"]*?)+(?:>|$)
? –
Leveroni <18
as well while <18
is not a tag, it just a string –
Kind Using the browser's parser is the probably the best bet in current browsers. The following will work, with the following caveats:
- Your HTML is valid within a
<div>
element. HTML contained within<body>
or<html>
or<head>
tags is not valid within a<div>
and may therefore not be parsed correctly. textContent
(the DOM standard property) andinnerText
(non-standard) properties are not identical. For example,textContent
will include text within a<script>
element whileinnerText
will not (in most browsers). This only affects IE <=8, which is the only major browser not to supporttextContent
.- The HTML does not contain
<script>
elements. - The HTML is not
null
- The HTML comes from a trusted source. Using this with arbitrary HTML allows arbitrary untrusted JavaScript to be executed. This example is from a comment by Mike Samuel on the duplicate question:
<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>
Code:
var html = "<p>Some HTML</p>";
var div = document.createElement("div");
div.innerHTML = html;
var text = div.textContent || div.innerText || "";
textContent
. How many browsers do textContent
+ innerText
cover? BTW, I've edited my answer to include the jQuery way. –
Omora textContent
and innerText
. –
Flaky div.innerHTML = html
if the value of html
is NULL
. Worse, it won't properly report the error (instead says parent function has TypeError
). Chrome/IE do not crash. –
Nutritive var text = new DOMParser().parseFromString("<p>Some HTML</p>", "text/html").body.textContent || "";
one line version –
Mosemoseley var html = "<p>Hello, <b>World</b>";
var div = document.createElement("div");
div.innerHTML = html;
alert(div.innerText); // Hello, World
That pretty much the best way of doing it, you're letting the browser do what it does best -- parse HTML.
Edit: As noted in the comments below, this is not the most cross-browser solution. The most cross-browser solution would be to recursively go through all the children of the element and concatenate all text nodes that you find. However, if you're using jQuery, it already does it for you:
alert($("<p>Hello, <b>World</b></p>").text());
Check out the text method.
innerText
. –
Flaky var html = "<b>test</b>"; var text = $("<div/>").html(html).text();
Using $("<div/>")
lets you reuse the same element and less memory for consecutive calls or for loops. –
Louvenialouver var txt = "<p>my line</p><p>my other line</p>some other text"; alert($(txt).text();
where you don't proxy the string within a dom node. 3 lines in, 2 lines out. –
Sitka let text = $(`<div>${html_fragment}</div>`)
. –
Crust I know this question has an accepted answer, but I feel that it doesn't work in all cases.
For completeness and since I spent too much time on this, here is what we did: we ended up using a function from php.js (which is a pretty nice library for those more familiar with PHP but also doing a little JavaScript every now and then):
http://phpjs.org/functions/strip_tags:535
It seemed to be the only piece of JavaScript code which successfully dealt with all the different kinds of input I stuffed into my application. That is, without breaking it – see my comments about the <script />
tag above.
stripTags('<p onclick="alert(1)">mytext</p>', '<p>')
returns <p onclick="alert(1)">mytext</p>
–
Tesler © 2022 - 2024 — McMap. All rights reserved.
textContent
. How many browsers dotextContent
+innerText
cover? BTW, I've edited my answer to include the jQuery way. – Omora