Split 1GB Xml file using Java
Asked Answered
H

4

13

I have a 1GB Xml file, how can I split it into well-formed, smaller size Xml files using Java ?

Here is an example:

<records>
  <record id="001">
    <name>john</name>
  </record>
 ....
</records>

Thanks.

Hendrix answered 2/3, 2011 at 15:53 Comment(5)
That depends on what kind of XML you're handling.Actium
Maybe you could post a small example describing your file and how you want it to be split up. Because as larsmans mentioned that depends pretty much on how its structrued and how the small chunks should look like.Machinate
like this <records><record id="001"><name>john</name></record>....</records>Hendrix
either SAX (obvious choice) or some multi GB 64bit java and enjoy you DOMReive
with vtd-xml, total lines of code will probably be below 15.Houseclean
D
19

I would use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.

  1. Advance the XMLStreamReader to the local root element of the sub-fragment.
  2. You can then use the javax.xml.transform APIs to produce a new document from this XML fragment. This will advance the XMLStreamReader to the end of that fragment.
  3. Repeat step 1 for the next fragment.

Code Example

For the following XML, output each "statement" section into a file named after the "account attributes value":

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

This can be done with the following code:

import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

} 
Ding answered 2/3, 2011 at 16:27 Comment(8)
Why involve javax.xml.transform when we can pipe directly from XMLStreamReader to XMLStreamWriter, rolling to a new file between every nth record element?Strapless
Yea this is the hot tip, just "pipe" them together and occasionally "close" and reopen the XMLStreamWriter every N records.Doublestop
Can't transform a Source of type javax.xml.transform.stax.StAXSource ??Nard
@Nard - What version of the JDK are you using. I just reran the code as is and it worked perfectly fine. I am using Oracle JDK 1.7.0 for the Mac.Ding
Beta033, you might need this: System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");Footcandle
@BlaiseDoughan - nextTag, by definition, does not work if there is no whitespace or line break between a closing and starting <statement> tag, e.g. </statement><statement>. Could you recommend how to go about if my XML has tags with no whitespaces in between?Lionhearted
@Lionhearted What do you mean "by definition it does not work if there is no whitespace between a closing and starting tag"? The javadoc just states that the nextTag() method will skip over any whitespace there is, not that it needs to be there.Emelia
@Lionhearted I did have to change the while loop to while (xsr.isStartElement() || xsr.nextTag() == XMLStreamConstants.START_ELEMENT) and add an extra xsr.nextTag() just before the while loop. Perhaps that will work for you as well? The problem is that the sub-fragment transformation also advances to the next element so that the nextTag() moves one level too deep.Emelia
D
4

Try this, using Saxon-EE 9.3.

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:mode streamable="yes"/>
    <xsl:template match="record">
      <xsl:result-document href="record-{@id}.xml">
        <xsl:copy-of select="."/>
      </xsl:result-document>
    </xsl:template>
</xsl:stylesheet>

The software isn't free, but if it saves you a day's coding you can easily justify the investment. (Apologies for the sales pitch).

Damnify answered 2/3, 2011 at 23:36 Comment(0)
W
3

DOM , STax, SAX all will do but have there own pros and cons.

  1. You can't put all the data in-memory in case of DOM.
  2. Programming control is easier in case of DOM then Stax and then SAX.
  3. A combination of SAX and DOM is a better option.
  4. Using a Framework which already does this can be the best option. Have a look at smooks.http://www.smooks.org

Hope this helps

Wendiewendin answered 7/3, 2011 at 2:5 Comment(0)
H
0

I respectfully disagree with Blaise Doughan. SAX is not only hard to use, but very slow. With VTD-XML, you can not only use XPath to simplify processing logic (10x code reduction very common) but also much faster because there is no redundant encoding/decoding conversion. Below is the java code with vtd-xml

import java.io.FileOutputStream;
import com.ximpleware.*; 

public class split {
    public static void main(String[] args) throws Exception {       
        VTDGen vg = new VTDGen();       
        if (vg.parseHttpUrl("c:\\xml\\input.xml", true)) {
            VTDNav vn = vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/records/record");
            int i=-1,j=0;
            while ((i = ap.evalXPath()) != -1) {
            long l=vn.getElementFragment();
                (new FileOutputStream("out"+j+".xml")).write(vn.getXML().getBytes(), (int)l,(int)(l>>32));
                j++;
            }
        }
    }
}
Houseclean answered 2/3, 2011 at 20:34 Comment(6)
My suggestion was to use StAX not SAX. Also, from VTD-XML FAQ (vtd-xml.sourceforge.net/faq.html) the 1GB file size mentioned in the question is the upper bound of VTD-XML's range for handling namepace qualified XML.Ding
There's no significant performance difference between StAX and SAX. Both are as fast as you will get. Some people might find StAX easier to use, however - using an event-based programming model like SAX requires more programming maturity.Damnify
Without namespace support, vtd-xml supports file size up to 2GB in size. With extended VTD-XML has a file size limit of 256 GB, even with namespace support.Houseclean
This is piece from your code (VTDGen.parseFile() method): fis = new FileInputStream(f); byte[] b = new byte[(int) f.length()];. So, you load all file in memory. This is really disgustingly.Grapeshot
@Andremoniy--loading everything in memory is not the issue, as long as it doesn't blow up like DOM that causes out of memory exception... nowadays, 64-bit machine with 4GB memory is so common, am I not right?Houseclean
@Houseclean OP doesn't mention the type of environment this needs to run in. But if it is a multi-user environment and each user might be running this code, than a 4 GB machine will let max 4 users split up a 1 GB file like this. That might not be enough.Emelia

© 2022 - 2024 — McMap. All rights reserved.