Remove HTML tags from a String
Asked Answered
K

35

490

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\<.*?>", "") 

will work, but some things like &amp; won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

Klee answered 27/10, 2008 at 16:39 Comment(3)
use this with following guide : compile 'org.jsoup:jsoup:1.9.2'Bellini
https://mcmap.net/q/67458/-remove-html-tags-from-a-stringBellini
See also: https://mcmap.net/q/67458/-remove-html-tags-from-a-stringRodin
M
657

Use a HTML parser instead of regex. This is dead simple with Jsoup.

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

See also:

Montagu answered 30/6, 2010 at 13:24 Comment(22)
Jsoup is nice, but I encountered some drawbacks with it. I use it to get rid of XSS, so basically I expect a plain text input, but some evil person could try to send me some HTML. Using Jsoup, I can remove all HTML but, unfortunately it also shrinks many spaces to one and removes link breaks (\n characters)Cay
@Ridcully: for that you'd like to use Jsoup#clean() instead.Montagu
using clean() will still cause extra spaces and \n chars to be removed. ex: Jsoup.clean("a \n b", Whitelist.none()) returns "a b"Jaye
@Keith: of course extra spaces and \n will be removed as HTML ignores them and you are calling .clean()Beghtol
Will this perform 'attribute escape' as well? I am specifically referring to Rule #2 in this list: owasp.org/index.php/…Centrepiece
@Nels: you're talking about Jsoup#clean()? Yes definitely. Click the "Jsoup#clean()" link in my previous comment.Montagu
Alas, it removed new lines. And we need them :)Blocky
input.replaceAll("<[^>]*>", "");Covey
@Zeroows: this fails miserably on <p>Lorem ipsum 1 < 3 dolor sit amet</p>. Again, HTML is not a regular language. It's completely beyond me why everyone keeps trying to throw regex on it to parse parts of interest instead of using a real parser.Montagu
I also found this answer not satisfying against XSS and I posted another answer.Kabuki
That is WAY better than Html.fromHtmlSwimmingly
use Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false)); to preserve linebreaksSpan
compile this : compile 'org.jsoup:jsoup:1.9.2'Bellini
It may be better, but it introduces another dependency to your project which is not goodKeratitis
Didn't work for the input text, "Hi <span>there! Please don't use <and >"Glower
Can someone elaborate what makes this better than Html.fromHtml(..).toString()?Hydrolyse
@FrankKrumnow: Html.fromHtml(..).toString() isn't available in standard Java nor as a standard Java library. It's only available in Android.Montagu
My bad. Did see the question in an android context.Hydrolyse
I've just realized that this answer implies HTML string. What about if we're talking about general string? I.e. there's no way to turn off escaping lt, gt, amp, quot. If string contains &, JSoup will clean it as &amp. (tried to use it for JSON string sanitation, to prevent XSS)Algorithm
@jalmasi: just traverse JSON and sanitize each string property instead of the whole JSON object itself. Or, if your JSON parser supports it, register a new listener/adapter/whateverTheyCallIt so that your JSON parser does it automatically.Montagu
@Montagu that's exactly what I'm doing - jackson, de/serializer for by type. But Jsoup cleaner replaces gt, lt, amp, quot no matter what, and there's no way to turn that off. I.e. there's no EscapeMode.none :)Algorithm
This is not working for me, can you please check #73862239Overtake
K
304

If you're writing for Android you can do this...

androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()

Klipspringer answered 17/6, 2011 at 12:48 Comment(5)
Awesome tip. :) If you're displaying the text in a TextView, you can drop the .toString() to preserve some formatting, too.Frontier
@Branky It doesn't I have tried...the accepted answer works like charmChomp
This is good, but <img> tags are replaced with some bizarre things. I got small squares where there was an imageMarsala
@BibaswannBandyopadhyay another answer helps getting rid of these charactersPhospholipide
use package androidx.core.text instead of legacy android.textPriebe
M
98

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

Matheny answered 27/10, 2008 at 17:0 Comment(2)
You also run into issues, if there is unescaped < or > sign inside the html node content. <span>My age is < a lot's of text > then your age</span>. i think that only 100% way to do this is via some XML DOM interface (like SAX or similar), to use node.getText().Tymes
This does working for string like "\r\n HDFC Bank <\/a>\r\n <\/div>\r\n <\/td>\r\n"Overtake
A
31

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Agenda answered 18/1, 2009 at 14:16 Comment(4)
The result of "a < b or b > c" is "a b or b > c", which seems unfortunate.Upandcoming
This worked the best for me. I needed to preserve line breaks. I did by adding this simple method to the parser: @Override public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { if (t == HTML.Tag.P || t == HTML.Tag.BR) { s.append('\n'); } }Auxin
dfrankow: The mathematical expression a < b or b > c should be written in html like this: a &lt; b or b &gt; cAuxin
I love that this doesn't have external dependencies.Regurgitation
H
28

I think that the simpliest way to filter the html tags is:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}
Hassi answered 4/11, 2010 at 10:13 Comment(1)
Two things missing: special character decoding, tags p and br; optionally you want to omit script content.Dwelling
C
20

On Android, try this:

String result = Html.fromHtml(html).toString();
Cocteau answered 4/5, 2015 at 4:29 Comment(10)
This did it! it removed all inline html from text :)Capita
You are always using code snippets for normal code. Code Snippets are only supposed to be used for HTML or javascript or other code which can be run in the browser. You cannot run Java in the browser. Use normal code blocks in the future... I will edit your answer for you this time and fix the formatting etc, but please don't do this anymore in the future. This isn't the first time I told you about this...Arrowhead
What library is this coming from?Hettie
@PaulCroarkin this is the library inside android sdk . android.text.HtmlCocteau
But this is adding plenty of white spaces or new lines to the end of my string. Not cool.Selway
works fine for me man, may be you should check your input, like does it comes with any white space like that..Cocteau
Awesome. Removed all html tags.Innuendo
looks familiar, like my answer from 2011.Klipspringer
that removed another headache from my plate :)Chainplate
Helped a lot. This works for me. Thank you.Infant
A
19

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());
Absently answered 5/8, 2011 at 21:11 Comment(4)
Jericho was able to parse <br> to a line break. Jsoup and HTMLEditorKit could not do that.Adne
Jericho is very capable of doing this job, used it a lot in owned projects.Velasco
Jericho worked like a charm. Thanks for the suggestion. One note: you don't have to create a Segment of the whole string. Source extends Segment, so either works in the Renderer constructor.Shirk
Jerico now seems to be a bit dated ( the last release was 3.4 in late 2015). However, if it still works well, then it still works well!Dissimulation
K
18

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

  • It removes line breaks from the text
  • It converts text &lt;script&gt; into <script>

If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:

// breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String replace = input.replace("&amp;", "");
// decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.

And here is a bunch of test cases (input to output):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&lt;script&gt;", ""},
{"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}

If you find a way to make it better, please let me know.

Kabuki answered 13/5, 2014 at 4:12 Comment(1)
This will fail against something like &#38;lt;script&#38;gt;alert('Evil script executed');&#38;lt;/script&#38;gt;. Same goes for &#x26;. JSoup does not convert &lt;script&gt; into <script>, it does that because you call StringEscapeUtils.unescapeHtml after JSoup cleaned up the input.Ginsburg
F
12

HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.

Fairing answered 27/10, 2008 at 17:3 Comment(6)
This is the sort of thing I'm looking for but I want to strip the HTML instead of escaping it.Klee
do you want to strip the html, or do you want to convert it to plain text? Stripping the HTML from a long string with br tags and HTML entities can result in an illegible mess.Fairing
Have you tried StringEscapeUtils.unescapeHtml? from commons-lang?Fastback
StringEscapeUtils.unescapeHtml does not strip htmlPadegs
Good information on utils to use for unescaping but not answering the question.Troyes
Confusing answer. Removing != UnescapingCupbearer
L
12

This should work -

use this

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

and this

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.
Lalise answered 30/6, 2017 at 11:42 Comment(2)
Generally, answers are much more useful if they include an explanation of what the code is intended to do.Knapsack
@Lalise No explanation of the answer at all? Not good.Accroach
G
9

You can simply use the Android's default HTML filter

    public String htmlToStringFilter(String textToFilter){

    return Html.fromHtml(textToFilter).toString();

    }

The above method will return the HTML filtered string for your input.

Garlen answered 29/3, 2019 at 8:37 Comment(1)
Cannot resolve symbol 'Html'Capers
R
7

You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.

The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...

replaceAll("\\<[\s]*tag[^>]*>","")

Then HTML-decode special characters such as &amp;. The result should not be considered to be sanitized.

Rotunda answered 27/10, 2008 at 23:52 Comment(0)
G
5

Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback {
    private static final Logger log = Logger
            .getLogger(Logger.GLOBAL_LOGGER_NAME);

    private StringBuffer stringBuffer;

    private Stack<IndexType> indentStack;

    public static class IndexType {
        public String type;
        public int counter; // used for ordered lists

        public IndexType(String type) {
            this.type = type;
            counter = 0;
        }
    }

    public HTML2Text() {
        stringBuffer = new StringBuffer();
        indentStack = new Stack<IndexType>();
    }

    public static String convert(String html) {
        HTML2Text parser = new HTML2Text();
        Reader in = new StringReader(html);
        try {
            // the HTML to convert
            parser.parse(in);
        } catch (Exception e) {
            log.severe(e.getMessage());
        } finally {
            try {
                in.close();
            } catch (IOException ioe) {
                // this should never happen
            }
        }
        return parser.getText();
    }

    public void parse(Reader in) throws IOException {
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("StartTag:" + t.toString());
        if (t.toString().equals("p")) {
            if (stringBuffer.length() > 0
                    && !stringBuffer.substring(stringBuffer.length() - 1)
                            .equals("\n")) {
                newLine();
            }
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.push(new IndexType("ol"));
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.push(new IndexType("ul"));
            newLine();
        } else if (t.toString().equals("li")) {
            IndexType parent = indentStack.peek();
            if (parent.type.equals("ol")) {
                String numberString = "" + (++parent.counter) + ".";
                stringBuffer.append(numberString);
                for (int i = 0; i < (4 - numberString.length()); i++) {
                    stringBuffer.append(" ");
                }
            } else {
                stringBuffer.append("*   ");
            }
            indentStack.push(new IndexType("li"));
        } else if (t.toString().equals("dl")) {
            newLine();
        } else if (t.toString().equals("dt")) {
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.push(new IndexType("dd"));
            newLine();
        }
    }

    private void newLine() {
        stringBuffer.append("\n");
        for (int i = 0; i < indentStack.size(); i++) {
            stringBuffer.append("    ");
        }
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        log.info("EndTag:" + t.toString());
        if (t.toString().equals("p")) {
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("li")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.pop();
            ;
        }
    }

    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("SimpleTag:" + t.toString());
        if (t.toString().equals("br")) {
            newLine();
        }
    }

    public void handleText(char[] text, int pos) {
        log.info("Text:" + new String(text));
        stringBuffer.append(text);
    }

    public String getText() {
        return stringBuffer.toString();
    }

    public static void main(String args[]) {
        String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
        System.out.println(convert(html));
    }
}
Gerdi answered 23/4, 2010 at 21:22 Comment(0)
C
5

One more way can be to use com.google.gdata.util.common.html.HtmlToText class like

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

Creamer answered 6/8, 2010 at 18:23 Comment(0)
U
5

The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".

So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}
Upandcoming answered 12/8, 2010 at 23:24 Comment(0)
R
5

Alternatively, one can use HtmlCleaner:

private CharSequence removeHtmlFrom(String html) {
    return new HtmlCleaner().clean(html).getText();
}
Rodin answered 17/2, 2014 at 20:19 Comment(1)
HtmlCleaner works well, keeps line breaks and has a recent release (2.21 in May 2017).Dissimulation
L
5

Use Html.fromHtml

HTML Tags are

<a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
<h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
<i>, <p>, <small>
<strike>,  <strong>, <sub>, <sup>, <tt>, <u>

As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.

Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.

Example

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Then

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Output

This is about me text that the user can put into their profile

Luedtke answered 23/11, 2015 at 12:11 Comment(1)
No extra utilities and aligns with Android Docs. +1Busload
L
5

Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)

content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.

Lackluster answered 20/6, 2018 at 7:36 Comment(2)
I improved it a bit: {code} .replaceAll("(<.*?>)|(&.*?;)", " ").replaceAll("\\s{2,}", " ") {code} Because often those tags are just next to text. And after removing tags change all 2 and more writespaces to just 1.Isiahisiahi
This answer would take cakeArquit
G
4

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}
Goodloe answered 18/1, 2009 at 2:16 Comment(0)
W
4

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

noHTMLString.replaceAll("\\&.*?\\;", "");

instead of this:

html = html.replaceAll("&nbsp;","");
html = html.replaceAll("&amp;"."");
Wulfila answered 7/6, 2011 at 14:13 Comment(0)
P
2

Here is another way to do it:

public static String removeHTML(String input) {
    int i = 0;
    String[] str = input.split("");

    String s = "";
    boolean inTag = false;

    for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
        inTag = true;
    }
    if (!inTag) {
        for (i = 0; i < str.length; i++) {
            s = s + str[i];
        }
    }
    return s;
}
Poultice answered 16/10, 2011 at 11:37 Comment(1)
Or you can just say, if(input.indexOf("<") > 0 || input.indexOf(">") > 0) return ""; else return input;Czar
S
2

One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:

InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())
Safar answered 4/9, 2012 at 8:42 Comment(1)
Note that the parse method is deprecated in favor of Parse.parse(InputStream, ContentHandler, Metadata, ParseContext).Trifocal
S
1

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
    html = html.replace(tag, NEW_LINE_MARK+tag);
}

String text = Jsoup.parse(html).text();

text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");
Sueannsuede answered 4/9, 2015 at 20:53 Comment(0)
L
1
classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim() 
Loxodrome answered 26/1, 2018 at 12:27 Comment(1)
While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations!Aniconic
I
1

Sometimes the html string come from xml with such &lt. When using Jsoup we need parse it and then clean it.

Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);

While only using Jsoup.parse(htmlstrl).text() can't remove tags.

Inhalant answered 3/9, 2020 at 9:3 Comment(2)
What is "Whitelist" here ?Overtake
Not working for "\r\n HDFC Bank <\/a>\r\n <\/div>\r\n <\/td>\r\n", I replaced Whitelist with SafelistOvertake
K
1

Try this for javascript:

const strippedString = htmlString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);
Knave answered 10/9, 2020 at 14:36 Comment(0)
A
1

You can use this method to remove the HTML tags from the String,

public static String stripHtmlTags(String html) {

    return html.replaceAll("<.*?>", "");

}
Accroach answered 1/3, 2021 at 15:44 Comment(1)
Not sure if you saw the comment on the accepted answer (from 2010) that says - try this <p>Lorem ipsum 1 < 3 dolor sit amet</p> and see how well the regex works ..Potential
S
0

My 5 cents:

String[] temp = yourString.split("&amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&";
    }
    yourString = tmp.substring(0, tmp.length() - 1);
}
Spermatid answered 9/8, 2011 at 14:40 Comment(0)
C
0

To get formateed plain html text you can do that:

String BR_ESCAPED = "&lt;br/&gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

To get formateed plain text change <br/> by \n and change last line by:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");
Cloutier answered 25/4, 2013 at 16:57 Comment(0)
B
0

I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:

Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
    Source source= new Source(htmlAsString);
 Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
                        String clearedHtml= m.replaceAll("");
Bulimia answered 25/5, 2020 at 11:14 Comment(0)
D
0

Worth noting that if you're trying to accomplish this in a Service Stack project, it's already a built-in string extension

using ServiceStack.Text;
// ...
"The <b>quick</b> brown <p> fox </p> jumps over the lazy dog".StripHtml();
Domestic answered 15/7, 2020 at 17:53 Comment(0)
D
0

I often find that I only need to strip out comments and script elements. This has worked reliably for me for 15 years and can easily be extended to handle any element name in HTML or XML:

// delete all comments
response = response.replaceAll("<!--[^>]*-->", "");
// delete all script elements
response = response.replaceAll("<(script|SCRIPT)[^+]*?>[^>]*?<(/script|SCRIPT)>", "");
Deettadeeyn answered 23/8, 2020 at 21:14 Comment(0)
I
0

You can use this code to remove HTML tags including line breaks.

function remove_html_tags(html) {
    html = html.replace(/<div>/g, "").replace(/<\/div>/g, "<br>");
    html = html.replace(/<br>/g, "$br$");
    html = html.replace(/(?:\r\n|\r|\n)/g, '$br$');
    var tmp = document.createElement("DIV");
    tmp.innerHTML = html;
    html = tmp.textContent || tmp.innerText;
    html = html.replace(/\$br\$/g, "\n");
    return html;
}
Intrigue answered 6/9, 2021 at 11:0 Comment(1)
Please provide additional details in your answer. As it's currently written, it's hard to understand your solution.Immaterialize
A
-1

you can simply make a method with multiple replaceAll() like

String RemoveTag(String html){
   html = html.replaceAll("\\<.*?>","")
   html = html.replaceAll("&nbsp;","");
   html = html.replaceAll("&amp;"."");
   ----------
   ----------
   return html;
}

Use this link for most common replacements you need: http://tunes.org/wiki/html_20special_20characters_20and_20symbols.html

It is simple but effective. I use this method first to remove the junk but not the very first line i.e replaceAll("\<.*?>",""), and later i use specific keywords to search for indexes and then use .substring(start, end) method to strip away unnecessary stuff. As this is more robust and you can pin point exactly what you need in the entire html page.

Amaras answered 17/11, 2010 at 1:44 Comment(1)
Two notes. First, this is suboptimal - for each replaceAll call, Java will attempt to compile the first argument as a regex and run through the entire string to apply that regex to the string, processing a few dozen KB for a regular HTML page every time. Second, it's advised not to use replaceAll to replace simple (non-regex) strings, but instead use replace() (which also replaces all, unlike the name suggests).Monitorial
Y
-1

Remove HTML tags from string. Somewhere we need to parse some string which is received by some responses like Httpresponse from the server.

So we need to parse it.

Here I will show how to remove html tags from string.

    // sample text with tags

    string str = "<html><head>sdfkashf sdf</head><body>sdfasdf</body></html>";



    // regex which match tags

    System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>");



    // replace all matches with empty strin

    str = rx.Replace(str, "");



    //now str contains string without html tags
Yoicks answered 3/9, 2014 at 16:2 Comment(2)
Where do you get new System.Text.RegularExpressions.Regex(); from?Sand
@Sand this response applies to .NET, not Java like was requested in the questionPadegs

© 2022 - 2024 — McMap. All rights reserved.