How to remove extra empty lines from XML file?
Asked Answered
B

11

23

In short; i have many empty lines generated in an XML file, and i am looking for a way to remove them as a way of leaning the file. How can i do that ?

For detailed explanation; I currently have this XML file :

<recent>
  <paths>
    <path>path1</path>
    <path>path2</path>
    <path>path3</path>
    <path>path4</path>
  </paths>
</recent>

And i use this Java code to delete all tags, and add new ones instead :

public void savePaths( String recentFilePath ) {
    ArrayList<String> newPaths = getNewRecentPaths();
    Document recentDomObject = getXMLFile( recentFilePath );  // Get the <recent> element.
    NodeList pathNodes = recentDomObject.getElementsByTagName( "path" );   // Get all <path> nodes.

    //1. Remove all old path nodes :
        for ( int i = pathNodes.getLength() - 1; i >= 0; i-- ) { 
            Element pathNode = (Element)pathNodes.item( i );
            pathNode.getParentNode().removeChild( pathNode );
        }

    //2. Save all new paths :
        Element pathsElement = (Element)recentDomObject.getElementsByTagName( "paths" ).item( 0 );   // Get the first <paths> node.

        for( String newPath: newPaths ) {
            Element newPathElement = recentDomObject.createElement( "path" );
            newPathElement.setTextContent( newPath );
            pathsElement.appendChild( newPathElement );
        }

    //3. Save the XML changes :
        saveXMLFile( recentFilePath, recentDomObject ); 
}

After executing this method a number of times i get an XML file with right results, but with many empty lines after the "paths" tag and before the first "path" tag, like this :

<recent>
  <paths>





    <path>path5</path>
    <path>path6</path>
    <path>path7</path>
  </paths>
</recent>

Anyone knows how to fix that ?

------------------------------------------- Edit: Add the getXMLFile(...), saveXMLFile(...) code.

public Document getXMLFile( String filePath ) { 
    File xmlFile = new File( filePath );

    try {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document domObject = db.parse( xmlFile );
        domObject.getDocumentElement().normalize();

        return domObject;
    } catch (Exception e) {
        e.printStackTrace();
    }

    return null;
}

public void saveXMLFile( String filePath, Document domObject ) {
    File xmlOutputFile = null;
    FileOutputStream fos = null;

    try {
        xmlOutputFile = new File( filePath );
        fos = new FileOutputStream( xmlOutputFile );
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
        transformer.setOutputProperty( "{http://xml.apache.org/xslt}indent-amount", "2" );
        DOMSource xmlSource = new DOMSource( domObject );
        StreamResult xmlResult = new StreamResult( fos );
        transformer.transform( xmlSource, xmlResult );  // Save the XML file.
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (TransformerConfigurationException e) {
        e.printStackTrace();
    } catch (TransformerException e) {
        e.printStackTrace();
    } finally {
        if (fos != null)
            try {
                fos.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
    }
}
Burglar answered 1/10, 2012 at 8:19 Comment(2)
It might be helpful to see the contents of your saveXMLFile method.Intonation
You could have a look at Deleting Nodes and Empty Lines in XML Using Java and #7191139Agger
B
5

I was able to fix this by using this code after removing all the old "path" nodes :

while( pathsElement.hasChildNodes() )
    pathsElement.removeChild( pathsElement.getFirstChild() );

This will remove all the generated empty spaces in the XML file.

Special thanks to MadProgrammer for commenting with the helpful link mentioned above.

Burglar answered 1/10, 2012 at 13:12 Comment(3)
I wouldn't be a huge fan of blindly removing child nodes without knowing what they are. At the least, I'd include a test here to see that I really am removing an empty text node (using 'getNodeType' and 'getNodeValue').Contango
@Contango .. I agree with you, but in my case i am sure they are all empty, because i have already deleted them myself. On the opposite, if there is something missing and not deleted, then i want to remove it :)Burglar
@Brad, please check my answer: goo.gl/06Qd9 , I explained how to remove these blank lines without blind removing all the child nodes, and wrote something about the cause of such behavior.Mercorr
C
31

First, an explanation of why this happens — which might be a bit off since you didn't include the code that is used to load the XML file into a DOM object.

When you read an XML document from a file, the whitespaces between tags actually constitute valid DOM nodes, according to the DOM specification. Therefore, the XML parser treats each such sequence of whitespaces as a DOM node (of type TEXT);

To get rid of it, there are three approaches I can think of:

  • Associate the XML with a schema, and then use setValidating(true) along with setIgnoringElementContentWhitespace(true) on the DocumentBuilderFactory.

    (Note: setIgnoringElementContentWhitespace will only work if the parser is in validating mode, which is why you must use setValidating(true))

  • Write an XSL to process all nodes, filtering out whitespace-only TEXT nodes.
  • Use Java code to do this: use XPath to find all whitespace-only TEXT nodes, iterate through them and remove each one from its parent (using getParentNode().removeChild()). Something like this would do (doc would be your DOM document object):

    XPath xp = XPathFactory.newInstance().newXPath();
    NodeList nl = (NodeList) xp.evaluate("//text()[normalize-space(.)='']", doc, XPathConstants.NODESET);
    
    for (int i=0; i < nl.getLength(); ++i) {
        Node node = nl.item(i);
        node.getParentNode().removeChild(node);
    }
    
Contango answered 1/10, 2012 at 8:57 Comment(3)
I do not know how to do that :), but i have added the getXMLFile(...) code to the question.Burglar
Another possibility would be to define an XML Schema to validate the document against, and then use DocumentBuilderFactory's "setIgnoringElementContentWhitespace" in conjunction with "setValidating". Many ways to skin this cat.Contango
How could i remove the new line in the <p> tag eg: <p id="P2">Cytochrome P450 reductase (NADPH-cytochrome P450 oxidoreductase; EC 1.6.2.4; abbreviated as either POR or CPR) is the key electron donor to the cytochrome P450 (P450) superfamily of xenobiotic metabolizing enzymes. It also plays a number of important roles in endogenous metabolism, passing electrons to a range of acceptors including cytochrome b5 (supporting fatty acid desaturase and elongase activities), squalene monooxygenase (sterol biosyn</p>Ampereturn
B
5

I was able to fix this by using this code after removing all the old "path" nodes :

while( pathsElement.hasChildNodes() )
    pathsElement.removeChild( pathsElement.getFirstChild() );

This will remove all the generated empty spaces in the XML file.

Special thanks to MadProgrammer for commenting with the helpful link mentioned above.

Burglar answered 1/10, 2012 at 13:12 Comment(3)
I wouldn't be a huge fan of blindly removing child nodes without knowing what they are. At the least, I'd include a test here to see that I really am removing an empty text node (using 'getNodeType' and 'getNodeValue').Contango
@Contango .. I agree with you, but in my case i am sure they are all empty, because i have already deleted them myself. On the opposite, if there is something missing and not deleted, then i want to remove it :)Burglar
@Brad, please check my answer: goo.gl/06Qd9 , I explained how to remove these blank lines without blind removing all the child nodes, and wrote something about the cause of such behavior.Mercorr
D
2

You could look at something like this if you only need to "clean" your xml quickly. Then you could have a method like:

public static String cleanUp(String xml) {
    final StringReader reader = new StringReader(xml.trim());
    final StringWriter writer = new StringWriter();
    try {
        XmlUtil.prettyFormat(reader, writer);
        return writer.toString();
    } catch (IOException e) {
        e.printStackTrace();
    }
    return xml.trim();
}

Also, to compare anche check differences, if you need it: XMLUnit

Denadenae answered 1/10, 2012 at 8:45 Comment(2)
Which library does XmlUtil belong to? Please always mention the library...Pillowcase
It is however a violation of the aim of XMLUnit. This library is explicit implemented to test the code producing an XML output a better way. In this sense, it should not be used in productive code...Brunhild
M
2

I faced the same problem, and I had no idea for the long time, but now, after this Brad's question and his own answer on his own question, I figured out where is the trouble.

I have to add my own answer, because Brad's one isn't really perfect, how Isaac said:

I wouldn't be a huge fan of blindly removing child nodes without knowing what they are

So, better "solution" (quoted because it is more likely workaround) is:

pathsElement.setTextContent("");

This completely removes useless blank lines. It is definitely better than removing all the child nodes. Brad, this should work for you too.

But, this is an effect, not the cause, and we got how to remove this effect, not the cause.

Cause is: when we call removeChild(), it removes this child, but it leaves indent of removed child, and line break too. And this indent_and_like_break is treated as a text content.

So, to remove the cause, we should figure out how to remove child and its indent. Welcome to my question about this.

Mercorr answered 10/1, 2013 at 9:57 Comment(1)
Yup, much simpler... assuming you DO want to blindly remove all child nodes without knowing what they are. :-)Kirstiekirstin
S
1

There is a very simple way to get rid of the empty lines if using an DOM handling API (for example DOM4J):

  • place the text you want to keep in a variable(ie text)
  • set the node text to "" using node.setText("")
  • set the node text to text using node.setText(text)

et voila! there are no more empty lines. The other answers delineate very well how the extra empty lines in the xml output are actually extra nodes of type text.

This technique can be used with any DOM parsing system, so long as the name of the text setting function is changed to suit the one in your API, hence the way of representing it slightly more abstractly.

Hope this helps:)

Shackelford answered 9/5, 2014 at 10:0 Comment(0)
A
1

When i used dom4j to remove some elements and i met the same question,the solution above not useful without adding some other required jars.Finally,i find out a simple solution only need to use JDK io pakage:

  1. use BufferedReader to read the xml file and filter empty lines.
StringBuilder stringBuilder = new StringBuilder();
FileInputStream fis = new FileInputStream(outFile);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
String s;
while ((s = br.readLine()) != null) {
  if (s.trim().length() > 0) {
    stringBuilder.append(s).append("\n");
  }
}
  1. write the string to the xml file
OutputStreamWriter osw = new OutputStreamWriter(fou);
BufferedWriter bw = new BufferedWriter(osw);
String str = stringBuilder.toString();
bw.write(str);
bw.flush();
  1. remember to close all the stream
Ardent answered 4/6, 2020 at 1:31 Comment(1)
Well, i tried this today, and it works good 🙂Burglar
F
1

In my case, I converted it to a string then just did a regex:

        //save as String
        StringWriter writer = new StringWriter();
        StreamResult result = new StreamResult(writer);
        tr.transform(new DOMSource(document), result);
        strResult = writer.toString();

        //remove empty lines 
        strResult = strResult.replaceAll("\\n\\s*\\n", "\n");
Foumart answered 6/5, 2021 at 14:8 Comment(1)
Yes, ideal when you need a string.Poriferous
P
0

Couple of remarks: 1) When your are manipulating XML (removing elements / adding new one) I strongly advice you to use XSLT (and not DOM) 2) When you tranform a XML Document by XSLT (as you do in your save method), set the OutputKeys.INDENT to "no" 3) For simple post processing of your xml (removing white space, comments, etc.) you can use a simple SAX2 filter

Procure answered 1/10, 2012 at 8:41 Comment(0)
L
0
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringElementContentWhitespace(true);
Lorenzoloresz answered 5/11, 2013 at 18:15 Comment(1)
This will not ignore white space in the newly generated XML.Tested this.Zobkiw
H
0

I am using below code:

System.out.println("Start remove textnode");
        i=0;
        while (parentNode.getChildNodes().item(i)!=null) {
            System.out.println(parentNode.getChildNodes().item(i).getNodeName());
            if (parentNode.getChildNodes().item(i).getNodeName().equalsIgnoreCase("#text")) {
                parentNode.removeChild(parentNode.getChildNodes().item(i));
                System.out.println("text node removed");
            }
            i=i+1;

        }
Hydrocellulose answered 11/7, 2014 at 6:48 Comment(0)
S
0

Very late answer, but maybe it is still helpful to someone.

I had this code in my class, where the document is built after transformation (Just like you):

TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");

Change the last line to

transformer.setOutputProperty(OutputKeys.INDENT, "no");
Sixpence answered 3/1, 2022 at 16:19 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Claus

© 2022 - 2024 — McMap. All rights reserved.