How can I parse a HTML string in Java?

O

7

17

Given the string "<table><tr><td>Hello World!</td></tr></table>", what is the (easiest) way to get a DOM Element representing it?

Oyler answered 30/9, 2009 at 13:0 Comment(0)

O

3

I found this somewhere (don't remember where):

 public static DocumentFragment parseXml(Document doc, String fragment)
 {
    // Wrap the fragment in an arbitrary element.
    fragment = "<fragment>"+fragment+"</fragment>";
    try
    {
        // Create a DOM builder and parse the fragment.
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        Document d = factory.newDocumentBuilder().parse(
                new InputSource(new StringReader(fragment)));

        // Import the nodes of the new document into doc so that they
        // will be compatible with doc.
        Node node = doc.importNode(d.getDocumentElement(), true);

        // Create the document fragment node to hold the new nodes.
        DocumentFragment docfrag = doc.createDocumentFragment();

        // Move the nodes into the fragment.
        while (node.hasChildNodes())
        {
            docfrag.appendChild(node.removeChild(node.getFirstChild()));
        }
        // Return the fragment.
        return docfrag;
    }
    catch (SAXException e)
    {
        // A parsing error occurred; the XML input is not valid.
    }
    catch (ParserConfigurationException e)
    {
    }
    catch (IOException e)
    {
    }
    return null;
}

Oyler answered 2/10, 2009 at 12:28 Comment(0)

S

13

If you have a string which contains HTML you can use Jsoup library like this to get HTML elements:

String htmlTable= "<table><tr><td>Hello World!</td></tr></table>";
Document doc = Jsoup.parse(htmlTable);

// then use something like this to get your element:
Elements tds = doc.getElementsByTag("td");

// tds will contain this one element: <td>Hello World!</td>

Good luck!

Stubbed answered 8/4, 2015 at 19:39 Comment(0)

H

10

Here's a way:

import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {
       Reader reader = new StringReader("<table><tr><td>Hello</td><td>World!</td></tr></table>");
       HTMLEditorKit.Parser parser = new ParserDelegator();
       parser.parse(reader, new HTMLTableParser(), true);
       reader.close();
   }
}

class HTMLTableParser extends HTMLEditorKit.ParserCallback {

    private boolean encounteredATableRow = false;

    public void handleText(char[] data, int pos) {
        if(encounteredATableRow) System.out.println(new String(data));
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if(t == HTML.Tag.TR) encounteredATableRow = true;
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        if(t == HTML.Tag.TR) encounteredATableRow = false;
    }
}

Harriet answered 30/9, 2009 at 13:10 Comment(4)

What if I want to put all the data pieces into an array in the outer class, rather than print them out? – Brain 17/6, 2013 at 18:21

@Imray, go right ahead, you have my permission to put them in some sort of collection instead of printing them :) – Harriet 17/6, 2013 at 18:34

I put them in a collection inside the HTMLTableParser class, and then created a getter method to get them. Is that the best way to do it? – Brain 17/6, 2013 at 19:58

@BartKiers how is it related to topic question?? The question is "to get a DOM Element representing it", not to catch SAX events! – Jenny 22/1, 2014 at 10:56

H

5

you could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge

Hesione answered 30/9, 2009 at 13:3 Comment(0)

S

3

You could use Swing:

How do you make use of the HTML-processing capabilities that are built into Java? You may not know that Swing contains all the classes necessary to parse HTML. Jeff Heaton shows you how.

Serinaserine answered 30/9, 2009 at 13:2 Comment(0)

P

3

I've used Jericho HTML Parser it's OSS, detects(forgives) badly formatted tags and is lightweight

Pettway answered 30/9, 2009 at 13:10 Comment(0)

O

3

I found this somewhere (don't remember where):

 public static DocumentFragment parseXml(Document doc, String fragment)
 {
    // Wrap the fragment in an arbitrary element.
    fragment = "<fragment>"+fragment+"</fragment>";
    try
    {
        // Create a DOM builder and parse the fragment.
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        Document d = factory.newDocumentBuilder().parse(
                new InputSource(new StringReader(fragment)));

        // Import the nodes of the new document into doc so that they
        // will be compatible with doc.
        Node node = doc.importNode(d.getDocumentElement(), true);

        // Create the document fragment node to hold the new nodes.
        DocumentFragment docfrag = doc.createDocumentFragment();

        // Move the nodes into the fragment.
        while (node.hasChildNodes())
        {
            docfrag.appendChild(node.removeChild(node.getFirstChild()));
        }
        // Return the fragment.
        return docfrag;
    }
    catch (SAXException e)
    {
        // A parsing error occurred; the XML input is not valid.
    }
    catch (ParserConfigurationException e)
    {
    }
    catch (IOException e)
    {
    }
    return null;
}

Oyler answered 2/10, 2009 at 12:28 Comment(0)

T

0

One can use some of the javax.swing.text.html utility classes for parsing HTML.

import java.io.IOException;
import java.io.StringReader;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
//...
try {
    String htmlString = "<html><head><title>Example Title</title></head><body>Some text...</body></html>";
    HTMLEditorKit htmlEditKit = new HTMLEditorKit();
    HTMLDocument htmlDocument = (HTMLDocument) htmlEditKit.createDefaultDocument();
    HTMLEditorKit.Parser parser = new ParserDelegator();
    parser.parse(new StringReader(htmlString),
            htmlDocument.getReader(0), true);
    // Use HTMLDocument here
    System.out.println(htmlDocument.getProperty("title")); // Example Title
} catch(IOException e){
    //Handle
    e.printStackTrace();
}

See:

Tiffie answered 13/4, 2021 at 20:2 Comment(0)

Recommended topics

Hot tags