I have a 1GB Xml file, how can I split it into well-formed, smaller size Xml files using Java ?
Here is an example:
<records>
<record id="001">
<name>john</name>
</record>
....
</records>
Thanks.
I have a 1GB Xml file, how can I split it into well-formed, smaller size Xml files using Java ?
Here is an example:
<records>
<record id="001">
<name>john</name>
</record>
....
</records>
Thanks.
I would use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.
Code Example
For the following XML, output each "statement" section into a file named after the "account attributes value":
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
This can be done with the following code:
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
xsr.nextTag(); // Advance to statements element
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
t.transform(new StAXSource(xsr), new StreamResult(file));
}
}
}
nextTag
, by definition, does not work if there is no whitespace or line break between a closing and starting <statement>
tag, e.g. </statement><statement>
. Could you recommend how to go about if my XML has tags with no whitespaces in between? –
Lionhearted nextTag()
method will skip over any whitespace there is, not that it needs to be there. –
Emelia while
loop to while (xsr.isStartElement() || xsr.nextTag() == XMLStreamConstants.START_ELEMENT)
and add an extra xsr.nextTag()
just before the while
loop. Perhaps that will work for you as well? The problem is that the sub-fragment transformation also advances to the next element so that the nextTag()
moves one level too deep. –
Emelia Try this, using Saxon-EE 9.3.
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode streamable="yes"/>
<xsl:template match="record">
<xsl:result-document href="record-{@id}.xml">
<xsl:copy-of select="."/>
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>
The software isn't free, but if it saves you a day's coding you can easily justify the investment. (Apologies for the sales pitch).
DOM , STax, SAX all will do but have there own pros and cons.
Hope this helps
I respectfully disagree with Blaise Doughan. SAX is not only hard to use, but very slow. With VTD-XML, you can not only use XPath to simplify processing logic (10x code reduction very common) but also much faster because there is no redundant encoding/decoding conversion. Below is the java code with vtd-xml
import java.io.FileOutputStream;
import com.ximpleware.*;
public class split {
public static void main(String[] args) throws Exception {
VTDGen vg = new VTDGen();
if (vg.parseHttpUrl("c:\\xml\\input.xml", true)) {
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/records/record");
int i=-1,j=0;
while ((i = ap.evalXPath()) != -1) {
long l=vn.getElementFragment();
(new FileOutputStream("out"+j+".xml")).write(vn.getXML().getBytes(), (int)l,(int)(l>>32));
j++;
}
}
}
}
VTDGen.parseFile()
method): fis = new FileInputStream(f); byte[] b = new byte[(int) f.length()];
. So, you load all file in memory. This is really disgustingly. –
Grapeshot © 2022 - 2024 — McMap. All rights reserved.