Jsoup - extracting text

Asked 16/4, 2012 at 16:19 Answered 21/7, 2015 at 18:41

Solved java iteration jsoup text-extraction

I need to extract text from a node like this:

<div>
    Some text <b>with tags</b> might go here.
    <p>Also there are paragraphs</p>
    More text can go without paragraphs<br/>
</div>

And I need to build:

Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs

Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.

Is there are way to iterate contents of an element to receive text nodes as well. E.g.

Text node - Some text
Node <b> - with tags
Text node - might go here.
Node <p> - Also there are paragraphs
Text node - More text can go without paragraphs
Node <br> - <empty>

Aleph answered 16/4, 2012 at 16:19 Comment(0)

Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().

public static void main(String[] args) throws IOException {
    String str = "<div>" +
            "    Some text <b>with tags</b> might go here." +
            "    <p>Also there are paragraphs</p>" +
            "    More text can go without paragraphs<br/>" +
            "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    int i = 0;

    for (Node node : div.childNodes()) {
        i++;
        System.out.println(String.format("%d %s %s",
                i,
                node.getClass().getSimpleName(),
                node.toString()));
    }
}

Result:

1 TextNode 
 Some text 
2 Element <b>with tags</b>
3 TextNode  might go here. 
4 Element <p>Also there are paragraphs</p>
5 TextNode  More text can go without paragraphs
6 Element <br/>

Intermolecular answered 16/4, 2012 at 20:45 Comment(0)

for (Element el : doc.select("body").select("*")) {

        for (TextNode node : el.textNodes()) {

                    node.text() ));

        }

    }

Occasion answered 13/8, 2013 at 21:10 Comment(1)

I guess you are missing a system.out.println there in the loop, but this is the example that works fine for extracting all text nodes recursively. – Locution 14/11, 2018 at 7:58

Assuming you want text only (no tags) my solution is below.
Output is:
Some text with tags might go here. Also there are paragraphs. More text can go without paragraphs

public static void main(String[] args) throws IOException {
    String str = 
                "<div>"  
            +   "    Some text <b>with tags</b> might go here."
            +   "    <p>Also there are paragraphs.</p>"
            +   "    More text can go without paragraphs<br/>" 
            +   "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    StringBuilder builder = new StringBuilder();
    stripTags(builder, div.childNodes());
    System.out.println("Text without tags: " + builder.toString());
}

/**
 * Strip tags from a List of type <code>Node</code>
 * @param builder StringBuilder : input and output
 * @param nodesList List of type <code>Node</code>
 */
public static void stripTags (StringBuilder builder, List<Node> nodesList) {

    for (Node node : nodesList) {
        String nodeName  = node.nodeName();

        if (nodeName.equalsIgnoreCase("#text")) {
            builder.append(node.toString());
        } else {
            // recurse
            stripTags(builder, node.childNodes());
        }
    }
}

Staminody answered 16/12, 2014 at 20:21 Comment(0)

you can use TextNode for this purpose:

List<TextNode> bodyTextNode = doc.getElementById("content").textNodes();
    String html = "";
    for(TextNode txNode:bodyTextNode){
        html+=txNode.text();
    }

Mutt answered 21/7, 2015 at 18:41 Comment(0)

Recommended topics

Hot tags