How to get node contents from JDOM
Asked Answered
Z

7

7

I'm writing an application in java using import org.jdom.*;

My XML is valid,but sometimes it contains HTML tags. For example, something like this:

  <program-title>Anatomy &amp; Physiology</program-title>
  <overview>
       <content>
              For more info click <a href="page.html">here</a>
              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>
       </content>
  </overview>
  <key-information>
     <category>Health &amp; Human Services</category>

So my problem is with the < p > tags inside the overview.content node.

I was hoping that this code would work :

        Element overview = sds.getChild("overview");
        Element content = overview.getChild("content");

        System.out.println(content.getText());

but it returns blank.

How do I return all the text ( nested tags and all ) from the overview.content node ?

Thanks

Zwolle answered 27/10, 2011 at 0:23 Comment(2)
Hi, how can I flatten the content node out recursively, when the text is mixed in with other nodes. For example a hyperlink sits in the middle of a sentence. I've added a bounty for some help.Zwolle
Need to get all of the HTML inside the content tag, including a links and and ordered lists. ThanksZwolle
F
16

content.getText() gives immediate text which is only useful fine with the leaf elements with text content.

Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )

public static void main(String[] args) throws Exception {
    SAXBuilder builder = new SAXBuilder();
    String xmlFileName = "a.xml";
    Document doc = builder.build(xmlFileName);

    Element root = doc.getRootElement();
    Element overview = root.getChild("overview");
    Element content = overview.getChild("content");

    XMLOutputter outp = new XMLOutputter();

    outp.setFormat(Format.getCompactFormat());
    //outp.setFormat(Format.getRawFormat());
    //outp.setFormat(Format.getPrettyFormat());
    //outp.getFormat().setTextMode(Format.TextMode.PRESERVE);

    StringWriter sw = new StringWriter();
    outp.output(content.getContent(), sw);
    StringBuffer sb = sw.getBuffer();
    System.out.println(sb.toString());
}

Output

For more info click<a href="page.html">here</a><p>Learn more about the human body. Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>

Do explore other formatting options and modify above code to your need.

"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "

Fade answered 13/1, 2012 at 16:42 Comment(0)
T
3

You could try using method getValue() for the closest approximation, but what this does is concatenate all text within the element and descendants together. This won't give you the <p> tag in any form. If that tag is in your XML like you've shown, it has become part of the XML markup. It'd need to be included as &lt;p&gt; or embedded in a CDATA section to be treated as text.

Alternatively, if you know all elements that either may or may not appear in your XML, you could apply an XSLT transformation that turns stuff which isn't intended as markup into plain text.

Trinitrophenol answered 27/10, 2011 at 0:30 Comment(1)
Perfect answer for those who don't need the element names in mixed content. Thank you!Jany
A
3

Well, maybe that's what you need:

import java.io.StringReader;

import org.custommonkey.xmlunit.XMLTestCase;
import org.custommonkey.xmlunit.XMLUnit;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import org.testng.annotations.Test;
import org.xml.sax.InputSource;

public class HowToGetNodeContentsJDOM extends XMLTestCase
{
    private static final String XML = "<root>\n" + 
            "  <program-title>Anatomy &amp; Physiology</program-title>\n" + 
            "  <overview>\n" + 
            "       <content>\n" + 
            "              For more info click <a href=\"page.html\">here</a>\n" + 
            "              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>\n" + 
            "       </content>\n" + 
            "  </overview>\n" + 
            "  <key-information>\n" + 
            "     <category>Health &amp; Human Services</category>\n" + 
            "  </key-information>\n" + 
            "</root>";
    private static final String EXPECTED = "For more info click <a href=\"page.html\">here</a>\n" + 
            "<p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>";

    @Test
    public void test() throws Exception
    {
        XMLUnit.setIgnoreWhitespace(true);
        Document document = new SAXBuilder().build(new InputSource(new StringReader(XML)));
        List<Content> content = document.getRootElement().getChild("overview").getChild("content").getContent();
        String out = new XMLOutputter().outputString(content);
        assertXMLEqual("<root>" + EXPECTED + "</root>", "<root>" + out + "</root>");
    }
}

Output:

PASSED: test on instance null(HowToGetNodeContentsJDOM)

===============================================
    Default test
    Tests run: 1, Failures: 0, Skips: 0
===============================================

I am using JDom with generics: http://www.junlu.com/list/25/883674.html

Edit: Actually that's not that much different from Prashant Bhate's answer. Maybe you need to tell us what you are missing...

Appraisal answered 16/1, 2012 at 23:20 Comment(0)
V
1

If you're also generating the XML file you should be able to encapsulate your html data in <![CDATA[]]> so that it isn't parsed by the XML parser.

Vagina answered 18/1, 2012 at 2:56 Comment(1)
No, unfortunately I don't generate the XML, I just have to consume it.Zwolle
S
0

The problem is that the <content> node doesn't have a text child; it has a <p> child that happens to contain text.

Try this:

Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
Element p = content.getChild("p");
System.out.println(p.getText());

If you want all the immediate child nodes, call p.getChildren(). If you want to get ALL the child nodes, you'll have to call it recursively.

Slumberous answered 27/10, 2011 at 0:26 Comment(1)
And then just manually turn Element type nodes into textual representation... Probably simpler than what I had in mind.Trinitrophenol
N
0

Not particularly pretty but works fine (using JDOM API):

public static String getRawText(Element element) {
    if (element.getContent().size() == 0) {
        return "";
    }

    StringBuffer text = new StringBuffer();
    for (int i = 0; i < element.getContent().size(); i++) {
        final Object obj = element.getContent().get(i);
        if (obj instanceof Text) {
            text.append( ((Text) obj).getText() );
        } else if (obj instanceof Element) {
            Element e = (Element) obj;
            text.append( "<" ).append( e.getName() );
            // dump all attributes
            for (Attribute attribute : (List<Attribute>)e.getAttributes()) {
                text.append(" ").append(attribute.getName()).append("=\"").append(attribute.getValue()).append("\"");
            }
            text.append(">");
            text.append( getRawText( e )).append("</").append(e.getName()).append(">");
        }
    }
    return text.toString();
}

Prashant Bhate's solution is nicer though!

Niello answered 17/1, 2012 at 11:10 Comment(0)
M
0

If you want to output the content of some JSOM node just use

System.out.println(new XMLOutputter().outputString(node))
Maxima answered 15/9, 2016 at 9:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.