How do I preserve line breaks when using jsoup to convert html to plain text?
Asked Answered
P

15

117

I have the following code:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

And I have the result:

hello world yo googlez

But I want to break the line:

hello world
yo googlez

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

Paulenepauletta answered 12/4, 2011 at 19:11 Comment(3)
edit your text - there is no line break showing up in your question. In general please read the preview of your question before posting it, to check everything is showing up right.Mary
I asked the same question (without the jsoup requirement) but I still do not have a good solution: #2514207Maharaja
see @zeenosaur 's answer.Burleson
A
114

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements:

  1. if the original html contains newline(\n), it gets preserved
  2. if the original html contains br or p tags, they gets translated to newline(\n).
Adorable answered 26/10, 2013 at 2:57 Comment(6)
the answer by @MircoAttocchi works best for me. this solution leaves entities as such...that's not good! i.e. "La porta &egrave; aperta" remains unchanged, whereas I want "La porta è aperta".Chancey
br2nl is not the most helpful or accurate method nameKittrell
This is the best answer. But how about for (Element e : document.select("br")) e.after(new TextNode("\n", "")); appending real newline and not the sequence \n? See Node::after() and Elements::append() for the difference. The replaceAll() is not be needed in this case. Similar for p and other block elements.Pyrometer
@user121196's answer should be the chosen answer. If you still have HTML entities after you clean the input HTML, apply StringEscapeUtils.unescapeHtml(...) Apache commons to the output from the Jsoup clean.Ecchymosis
See github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/… for a comprehensive answer to this problem.Revitalize
<p>Line one</p>Line 2 should NOT be \nLine one Line 2 newlines have to be inserted before AND after the relevant block tags. and it's missing MANY block tags such as <div> and <li>.Turnbow
M
46

With

Jsoup.parse("A\nB").text();

you have output

"A B" 

and not

A

B

For this I'm using:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
Mcclees answered 17/5, 2011 at 13:26 Comment(4)
Indeed this is an easy palliative, but IMHO this should be fully handled by the Jsoup library itself (which has at this time a few disturbing behaviors like this one - otherwise it's a great library !).Foreworn
Doesn't JSoup give you a DOM? Why not just replace all <br> elements with text nodes containing new lines and then call .text() instead of doing a regex transform that will cause incorrect output for some strings like <div title=<br>'not an attribute'></div>Shelter
Nice, but where does that "descrizione" come from?Dalessio
"descrizione" represents the variable the plain text gets assigned toKirstinkirstyn
T
45
Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

Tentation answered 23/4, 2013 at 16:46 Comment(4)
This should be the only correct answer. All others assume that only br tags produce new lines. What about any other block element in HTML such as div, p, ul etc? All of them introduce new lines too.Freemon
With this solution, the html "<html><body><div>line 1</div><div>line 2</div><div>line 3</div></body></html>" produced the output: "line 1line 2line 3" with no new lines.Schermerhorn
This doesn't work for me; <br>'s aren't creating line breaks.Kaceykachina
Thanks! Also, Whitelist has been replaced with Safelist class.Whenas
F
37

On Jsoup v1.11.2, we can now use Element.wholeText().

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. But wholeText() preserves the alignment of texts.

Flunkey answered 17/5, 2018 at 14:4 Comment(1)
today, in 2023 is working :-) thanksAlcaide
I
24

Try this by using jsoup:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}
Incarcerate answered 24/6, 2013 at 15:42 Comment(2)
nice it works me with a small change new Document.OutputSettings().prettyPrint(true)Terreverte
This solution leaves "&nbsp;" as text instead of parsing them into a space.Instructive
M
11

For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:

Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);

(version 1.10.3)

Mountainous answered 21/9, 2017 at 12:49 Comment(1)
Yes this does a good job.Peers
P
7

You can traverse a given element

public String convertNodeToText(Element element)
{
    final StringBuilder buffer = new StringBuilder();

    new NodeTraversor(new NodeVisitor() {
        boolean isNewline = true;

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                String text = textNode.text().replace('\u00A0', ' ').trim();                    
                if(!text.isEmpty())
                {                        
                    buffer.append(text);
                    isNewline = false;
                }
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (!isNewline)
                {
                    if((element.isBlock() || element.tagName().equals("br")))
                    {
                        buffer.append("\n");
                        isNewline = true;
                    }
                }
            }                
        }

        @Override
        public void tail(Node node, int depth) {                
        }                        
    }).traverse(element);        

    return buffer.toString();               
}

And for your code

String result = convertNodeToText(JSoup.parse(html))
Pastorate answered 1/8, 2013 at 8:53 Comment(2)
I think you should test if isBlock in tail(node, depth) instead, and append \n when leaving the block rather than when entering it? I'm doing that (i.e. using tail) and that works fine. However if I use head like you do, then this: <p>line one<p>line two ends up as a single line.Eno
new NodeTraversor(nodeVisitor).traverse(element); no longer works on newer Jsoup versions (currently 1.14.3). Now all traverse methods in NodeTraversor are static so should be called like NodeTraversor.traverse(nodeVisitor, element);.Wheelsman
R
5

Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.

Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: HtmlToPlainText.java

The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping.

To avoid link rot, here is Jonathan Hedley's solution in full:

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;

/**
 * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
 * plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
 * scrape.
 * <p>
 * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
 * </p>
 * <p>
 * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
 * <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
 * where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
 * 
 * @author Jonathan Hedley, [email protected]
 */
public class HtmlToPlainText {
    private static final String userAgent = "Mozilla/5.0 (jsoup)";
    private static final int timeout = 5 * 1000;

    public static void main(String... args) throws IOException {
        Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
        final String url = args[0];
        final String selector = args.length == 2 ? args[1] : null;

        // fetch the specified URL and parse to a HTML DOM
        Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();

        HtmlToPlainText formatter = new HtmlToPlainText();

        if (selector != null) {
            Elements elements = doc.select(selector); // get each element that matches the CSS selector
            for (Element element : elements) {
                String plainText = formatter.getPlainText(element); // format that element to plain text
                System.out.println(plainText);
            }
        } else { // format the whole doc
            String plainText = formatter.getPlainText(doc);
            System.out.println(plainText);
        }
    }

    /**
     * Format an Element to plain-text
     * @param element the root element to format
     * @return formatted text
     */
    public String getPlainText(Element element) {
        FormattingVisitor formatter = new FormattingVisitor();
        NodeTraversor traversor = new NodeTraversor(formatter);
        traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node

        return formatter.toString();
    }

    // the formatting rules, implemented in a breadth-first DOM traverse
    private class FormattingVisitor implements NodeVisitor {
        private static final int maxWidth = 80;
        private int width = 0;
        private StringBuilder accum = new StringBuilder(); // holds the accumulated text

        // hit when the node is first seen
        public void head(Node node, int depth) {
            String name = node.nodeName();
            if (node instanceof TextNode)
                append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
            else if (name.equals("li"))
                append("\n * ");
            else if (name.equals("dt"))
                append("  ");
            else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
                append("\n");
        }

        // hit when all of the node's children (if any) have been visited
        public void tail(Node node, int depth) {
            String name = node.nodeName();
            if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
                append("\n");
            else if (name.equals("a"))
                append(String.format(" <%s>", node.absUrl("href")));
        }

        // appends text to the string builder with a simple word wrap method
        private void append(String text) {
            if (text.startsWith("\n"))
                width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
            if (text.equals(" ") &&
                    (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
                return; // don't accumulate long runs of empty spaces

            if (text.length() + width > maxWidth) { // won't fit, needs to wrap
                String words[] = text.split("\\s+");
                for (int i = 0; i < words.length; i++) {
                    String word = words[i];
                    boolean last = i == words.length - 1;
                    if (!last) // insert a space if not the last word
                        word = word + " ";
                    if (word.length() + width > maxWidth) { // wrap and reset counter
                        accum.append("\n").append(word);
                        width = word.length();
                    } else {
                        accum.append(word);
                        width += word.length();
                    }
                }
            } else { // fits as is, without need to wrap text
                accum.append(text);
                width += text.length();
            }
        }

        @Override
        public String toString() {
            return accum.toString();
        }
    }
}
Revitalize answered 19/5, 2017 at 8:21 Comment(1)
One advantage this has over the simple Element.wholeText is that it extracts href linksHydrography
P
4
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

works if the html itself doesn't contain "br2n"

So,

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();

works more reliable and easier.

Playwright answered 24/7, 2014 at 4:53 Comment(0)
S
3

Try this:

public String noTags(String str){
    Document d = Jsoup.parse(str);
    TextNode tn = new TextNode(d.body().html(), "");
    return tn.getWholeText();
}
Schweiz answered 12/4, 2011 at 20:8 Comment(2)
<p><b>hello world</b></p> <p><br /><b>yo</b> <a href="google.com">googlez</a></p> but i need hello world yo googlez (without html tags)Paulenepauletta
This answer doesn't return plain text; it returns HTML with newlines inserted.Eno
N
3

Use textNodes() to get a list of the text nodes. Then concatenate them with \n as separator. Here's some scala code I use for this, java port should be easy:

val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
                    .asScala.mkString("<br />\n")
Naturalism answered 18/9, 2013 at 17:2 Comment(0)
D
3

This is my version of translating html to text (the modified version of user121196 answer, actually).

This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).

It's originally written in Scala, but you can change it to Java easily

def html2text( rawHtml : String ) : String = {

    val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
    htmlDoc.select("br").append("\\nl")
    htmlDoc.select("div").prepend("\\nl").append("\\nl")
    htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")

    org.jsoup.parser.Parser.unescapeEntities(
        Jsoup.clean(
          htmlDoc.html(),
          "",
          Whitelist.none(),
          new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
        ),false
    ).
    replaceAll("\\\\nl", "\n").
    replaceAll("\r","").
    replaceAll("\n\\s+\n","\n").
    replaceAll("\n\n+","\n\n").     
    trim()      
}
Dordrecht answered 5/6, 2016 at 12:59 Comment(1)
You need to prepend a new line to <div> tags as well. Otherwise, if a div follows <a> or <span> tags, it will not be on a new line.Instructive
D
3

Try this by using jsoup:

    doc.outputSettings(new OutputSettings().prettyPrint(false));

    //select all <br> tags and append \n after that
    doc.select("br").after("\\n");

    //select all <p> tags and prepend \n before that
    doc.select("p").before("\\n");

    //get the HTML from the document, and retaining original new lines
    String str = doc.html().replaceAll("\\\\n", "\n");
Dauntless answered 8/9, 2017 at 19:38 Comment(0)
M
1
/**
 * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
 * @param html
 * @param linebreakerString
 * @return the html as String with proper java newlines instead of br
 */
public static String replaceBrWithNewLine(String html, String linebreakerString){
    String result = "";
    if(html.contains(linebreakerString)){
        result = replaceBrWithNewLine(html, linebreakerString+"1");
    } else {
        result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
        result = result.replaceAll(linebreakerString, "\n");
    }
    return result;
}

Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder. For example:

replaceBrWithNewLine(element.html(), "br2n")

The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.

Maidstone answered 25/1, 2014 at 18:48 Comment(2)
Good one, but you don't need recursion, just add this line: while(dirtyHTML.contains(linebreakerString)) linebreakerString = linebreakerString + "1";Spatiotemporal
Ah, yes. Completely true. Guess my mind got caught up in for once actually being able to use recursion :)Maidstone
H
1

Based on user121196's and Green Beret's answer with the selects and <pre>s, the only solution which works for me is:

org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();
Hans answered 31/5, 2016 at 18:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.