How to strip HTML tags from string in JavaScript? [duplicate]

Asked 15/2, 2011 at 9:56 Answered 7/9, 2012 at 16:14

319

How can I strip the HTML from a string in JavaScript?

Clouded answered 15/2, 2011 at 9:56 Comment(0)

320

Using the browser's parser is the probably the best bet in current browsers. The following will work, with the following caveats:

Your HTML is valid within a <div> element. HTML contained within <body> or <html> or <head> tags is not valid within a <div> and may therefore not be parsed correctly.
textContent (the DOM standard property) and innerText (non-standard) properties are not identical. For example, textContent will include text within a <script> element while innerText will not (in most browsers). This only affects IE <=8, which is the only major browser not to support textContent.
The HTML does not contain <script> elements.
The HTML is not null
The HTML comes from a trusted source. Using this with arbitrary HTML allows arbitrary untrusted JavaScript to be executed. This example is from a comment by Mike Samuel on the duplicate question: <img onerror='alert(\"could run arbitrary JS here\")' src=bogus>

Code:

var html = "<p>Some HTML</p>";
var div = document.createElement("div");
div.innerHTML = html;
var text = div.textContent || div.innerText || "";

Flaky answered 15/2, 2011 at 10:40 Comment(14)

Nice answer, I didn't know about textContent. How many browsers do textContent + innerText cover? BTW, I've edited my answer to include the jQuery way. – Omora 15/2, 2011 at 11:4

@Felix: All major browsers have at least one of textContent and innerText. – Flaky 15/2, 2011 at 11:5

Doesn't work when the string contains something like <script>alert('hi');</script>. Then it crashes with "illegal token at" etc.. – Sharolynsharon 19/8, 2012 at 1:4

Good caveats. In case it is not already clear I wanted to add that Firefox will crash on div.innerHTML = html if the value of html is NULL. Worse, it won't properly report the error (instead says parent function has TypeError). Chrome/IE do not crash. – Nutritive 24/1, 2013 at 22:20

SECURITY ISSUE ... This could be vulnerable as you're setting div.innerHTML ... i'm sure you don't wanted to get some unwanted script executed. ... manual cleanup would be cool. – Basion 9/8, 2016 at 11:44

@alee-sindhu: I think the caveats cover that. – Flaky 15/8, 2016 at 15:27

Elegant solution, but isn't universal. It doesn't work if you use it on node server because of the document dependency – Nollie 14/4, 2017 at 10:17

using React Native 0.51, can't use this solution – Emplane 11/1, 2018 at 10:34

<p>test</p><p>test</p> gives testtest, should have spave or newline between – Kimura 20/1, 2020 at 10:36

@eomeroff: That isn't what this question is asking for. – Flaky 20/1, 2020 at 10:42

Literally it is not. But having contents of two paragraphs becoming one word makes no sense. – Kimura 20/1, 2020 at 10:47

@eomeroff: Whether it makes sense depends on the context. What input do you want to accept and what do you require the output to be? – Flaky 20/1, 2020 at 15:5

@TimDown the one that closely corresponds to what is rendered with two <p> tags. For example, you take content from a rich text editor, remove the tags and past it to notepad. Should have the same spaces or/and line breaks. – Kimura 20/1, 2020 at 15:25

var text = new DOMParser().parseFromString("<p>Some HTML</p>", "text/html").body.textContent || ""; one line version – Mosemoseley 25/3 at 1:0

508

cleanText = strInputCode.replace(/<\/?[^>]+(>|$)/g, "");

Distilled from this website (web.achive).

This regex looks for <, an optional slash /, one or more characters that are not >, then either > or $ (the end of the line)

Examples:

'<div>Hello</div>' ==> 'Hello'
 ^^^^^     ^^^^^^
'Unterminated Tag <b' ==> 'Unterminated Tag '
                  ^^

But it is not bulletproof:

'If you are < 13 you cannot register' ==> 'If you are '
            ^^^^^^^^^^^^^^^^^^^^^^^^
'<div data="score > 42">Hello</div>' ==> ' 42">Hello'
 ^^^^^^^^^^^^^^^^^^          ^^^^^^

If someone is trying to break your application, this regex will not protect you. It should only be used if you already know the format of your input. As other knowledgable and mostly sane people have pointed out, to safely strip tags, you must use a parser.

If you do not have acccess to a convenient parser like the DOM, and you cannot trust your input to be in the right format, you may be better off using a package like sanitize-html, and also other sanitizers are available.

Terrorist answered 15/2, 2011 at 10:1 Comment(12)

Sorry, but that would break <img alt="a>b" src="a_b.gif" /> – Clouded 15/2, 2011 at 10:56

@Clouded people who make a hobby out of breaking the ill-use of regular expressions for parsing general HTML are great. It is a great hobby. – Epaminondas 7/5, 2013 at 18:39

@Ziggy: That sounds an awful lot like sarcasm... – Clouded 7/5, 2013 at 22:31

@Clouded no! Really! Every time I read one of these comment threads I get a little thrill. "Ho ho ho," I think "<img alt=\"a>b\" src=\"a_b.gif\" />, so clever!" – Epaminondas 8/5, 2013 at 5:28

@Clouded That would be buggy html, it had to be <img alt="a>b" . – Lyudmila 26/1, 2015 at 11:55

Not in HTML5, the syntax can be different. @Lyudmila – Dotson 14/3, 2015 at 19:16

using reg is not good approach #1732848 – Carpal 1/6, 2016 at 17:39

Could we improve it with <(?:[^><\"\']*?(?:([\"\']).*?\1)?[^><\'\"]*?)+(?:>|$) ? – Leveroni 17/7, 2017 at 7:30

this code will remove <18 as well while <18 is not a tag, it just a string – Kind 1/4, 2020 at 22:44

could somebody please what his regex eliminates and what it keeps? It works great for my needs so need to understand. – Morphinism 30/7, 2020 at 17:41

@Morphinism explained what this regex does in the answer body. – Terrorist 5/8, 2020 at 16:34

Using regexp is a perfectly fine approach if I control everything. Which I sometimes do. – Female 22/2, 2022 at 19:18

320

Using the browser's parser is the probably the best bet in current browsers. The following will work, with the following caveats:

Your HTML is valid within a <div> element. HTML contained within <body> or <html> or <head> tags is not valid within a <div> and may therefore not be parsed correctly.
textContent (the DOM standard property) and innerText (non-standard) properties are not identical. For example, textContent will include text within a <script> element while innerText will not (in most browsers). This only affects IE <=8, which is the only major browser not to support textContent.
The HTML does not contain <script> elements.
The HTML is not null
The HTML comes from a trusted source. Using this with arbitrary HTML allows arbitrary untrusted JavaScript to be executed. This example is from a comment by Mike Samuel on the duplicate question: <img onerror='alert(\"could run arbitrary JS here\")' src=bogus>

Code:

var html = "<p>Some HTML</p>";
var div = document.createElement("div");
div.innerHTML = html;
var text = div.textContent || div.innerText || "";