Convert XML file to CSV in Java
Asked Answered
D

8

12

Here is an example XML (case 1) :

<root>
      <Item>
        <ItemID>4504216603</ItemID>
        <ListingDetails>
          <StartTime>10:00:10.000Z</StartTime>
          <EndTime>10:00:30.000Z</EndTime>
          <ViewItemURL>http://url</ViewItemURL>
            ....
           </item> 
 

Here is an example XML (case 2) :

          <Item>
            <ItemID>4504216604</ItemID>
            <ListingDetails>
              <StartTime>10:30:10.000Z</StartTime>
              <!-- Start difference from case 1 -->
              <averages>
              <AverageTime>value1</AverageTime>
              <category type="TX">9823</category>
              <category type="TY">9112</category>
              <AveragePrice>value2</AveragePrice>
              </averages>
              <!-- End difference from case 1 -->
              <EndTime>11:00:10.000Z</EndTime>
              <ViewItemURL>http://url</ViewItemURL>
                ....
               </item>
                </root>

I borrowed this XML from Google. My objects are not always the same, sometimes there are extra elements like in case2. Now I'd like to produce CSV like this from both cases:

ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,value2

This 1st line is header it should also be included in csv. I got some useful links to stax today, I don't really don't know what is the right/optimal approach for this, and I am seeking ideas.

Update 1

I forgot to mention this is a huge XML, file up to 1gb.

Update 2

I'm looking for more Generic approach, meaning that this should work for any number of nodes with any depth, and sometimes as in the example XML, it can happen that one item object has greater number of nodes than the next/previous one so there should be also case for that (so all columns and values match in CSV).

Also it can happen that nodes have the same name/localName but different values and attributes, if that is the case then new column should appear in CSV with appropriate value. (I added example of this case inside <averages> tag called category)

Duologue answered 20/7, 2010 at 19:2 Comment(3)
Are the values value1, ... , valuen always immediate children of the <averages/> element? Is averages the only element that might appear? Or do you need to be more flexible about what appears there?Gummous
@cOmrade: about your "update": if not the first element is the one with the most columns, than you just need two passes/steps for the transformation: in step one you just collect all the columns, and in step 2 process them as described. If no value node for a particular node is found, than you can put a value you wish (null or empty or whatever convention you like - see my description in answer). It's not a problem that the nodes are nested since for CSV they will be red flat.Mouse
@A. Ionescu thank you for your response, I think I've gathered enough algorithms for solving this, I even have one implemented(not working properly) but right now I'm looking for some code hence the bounty.Duologue
T
14

The code provided should be considered a sketch rather than the definitive article. I am not an expert on SAX and the implementation could be improved for better performance, simpler code etc. That said SAX should be able to cope with streaming large XML files.

I would approach this problem with 2 passes using the SAX parser. (Incidentally, I would also use a CSV generating library to create the output as this would deal with all the fiddly character escaping that CSV involves but I haven't implemented this in my sketch).

First pass: Establish number of header columns

Second pass: Output CSV

I assume that the XML file is well formed. I assume that we don't have a scheme/DTD with a predefined order.

In the first pass I have assumed that a CSV column will be added for every XML element containing text content or for any attribute (I have assumed attributes will contain something!).

The second pass, having established the number of target columns, will do the actual CSV output.

Based on your example XML my code sketch would produce:

ItemID,StartTime,EndTime,ViewItemURL,AverageTime,category,category,type,type,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url,,,,,,
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,9823,9112,TX,TY,value2

Please note I have used the google collections LinkedHashMultimap as this is helpful when associating multiple values with a single key. I hope you find this useful!

import com.google.common.collect.LinkedHashMultimap;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.Map.Entry;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class App {

    public static void main(String[] args) throws SAXException, FileNotFoundException, IOException {
        // First pass - to determine headers
        XMLReader xr = XMLReaderFactory.createXMLReader();
        HeaderHandler handler = new HeaderHandler();
        xr.setContentHandler(handler);
        xr.setErrorHandler(handler);
        FileReader r = new FileReader("test1.xml");
        xr.parse(new InputSource(r));

        LinkedHashMap<String, Integer> headers = handler.getHeaders();
        int totalnumberofcolumns = 0;
        for (int headercount : headers.values()) {
            totalnumberofcolumns += headercount;
        }
        String[] columnheaders = new String[totalnumberofcolumns];
        int i = 0;
        for (Entry<String, Integer> entry : headers.entrySet()) {
            for (int j = 0; j < entry.getValue(); j++) {
                columnheaders[i] = entry.getKey();
                i++;
            }
        }
        StringBuilder sb = new StringBuilder();
        for (String h : columnheaders) {
            sb.append(h);
            sb.append(',');
        }
        System.out.println(sb.substring(0, sb.length() - 1));

        // Second pass - collect and output data

        xr = XMLReaderFactory.createXMLReader();

        DataHandler datahandler = new DataHandler();
        datahandler.setHeaderArray(columnheaders);

        xr.setContentHandler(datahandler);
        xr.setErrorHandler(datahandler);
        r = new FileReader("test1.xml");
        xr.parse(new InputSource(r));
    }

    public static class HeaderHandler extends DefaultHandler {

        private String content;
        private String currentElement;
        private boolean insideElement = false;
        private Attributes attribs;
        private LinkedHashMap<String, Integer> itemHeader;
        private LinkedHashMap<String, Integer> accumulativeHeader = new LinkedHashMap<String, Integer>();

        public HeaderHandler() {
            super();
        }

        private LinkedHashMap<String, Integer> getHeaders() {
            return accumulativeHeader;
        }

        private void addItemHeader(String headerName) {
            if (itemHeader.containsKey(headerName)) {
                itemHeader.put(headerName, itemHeader.get(headerName) + 1);
            } else {
                itemHeader.put(headerName, 1);
            }
        }

        @Override
        public void startElement(String uri, String name,
                String qName, Attributes atts) {
            if ("item".equalsIgnoreCase(qName)) {
                itemHeader = new LinkedHashMap<String, Integer>();
            }
            currentElement = qName;
            content = null;
            insideElement = true;
            attribs = atts;
        }

        @Override
        public void endElement(String uri, String name, String qName) {
            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                    addItemHeader(qName);
                }
                if (attribs != null) {
                    int attsLength = attribs.getLength();
                    if (attsLength > 0) {
                        for (int i = 0; i < attsLength; i++) {
                            String attName = attribs.getLocalName(i);
                            addItemHeader(attName);
                        }
                    }
                }
            }
            if ("item".equalsIgnoreCase(qName)) {
                for (Entry<String, Integer> entry : itemHeader.entrySet()) {
                    String headerName = entry.getKey();
                    Integer count = entry.getValue();
                    //System.out.println(entry.getKey() + ":" + entry.getValue());
                    if (accumulativeHeader.containsKey(headerName)) {
                        if (count > accumulativeHeader.get(headerName)) {
                            accumulativeHeader.put(headerName, count);
                        }
                    } else {
                        accumulativeHeader.put(headerName, count);
                    }
                }
            }
            insideElement = false;
            currentElement = null;
            attribs = null;
        }

        @Override
        public void characters(char ch[], int start, int length) {
            if (insideElement) {
                content = new String(ch, start, length);
            }
        }
    }

    public static class DataHandler extends DefaultHandler {

        private String content;
        private String currentElement;
        private boolean insideElement = false;
        private Attributes attribs;
        private LinkedHashMultimap dataMap;
        private String[] headerArray;

        public DataHandler() {
            super();
        }

        @Override
        public void startElement(String uri, String name,
                String qName, Attributes atts) {
            if ("item".equalsIgnoreCase(qName)) {
                dataMap = LinkedHashMultimap.create();
            }
            currentElement = qName;
            content = null;
            insideElement = true;
            attribs = atts;
        }

        @Override
        public void endElement(String uri, String name, String qName) {
            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                    dataMap.put(qName, content);
                }
                if (attribs != null) {
                    int attsLength = attribs.getLength();
                    if (attsLength > 0) {
                        for (int i = 0; i < attsLength; i++) {
                            String attName = attribs.getLocalName(i);
                            dataMap.put(attName, attribs.getValue(i));
                        }
                    }
                }
            }
            if ("item".equalsIgnoreCase(qName)) {
                String data[] = new String[headerArray.length];
                int i = 0;
                for (String h : headerArray) {
                    if (dataMap.containsKey(h)) {
                        Object[] values = dataMap.get(h).toArray();
                        data[i] = (String) values[0];
                        if (values.length > 1) {
                            dataMap.removeAll(h);
                            for (int j = 1; j < values.length; j++) {
                                dataMap.put(h, values[j]);
                            }
                        } else {
                            dataMap.removeAll(h);
                        }
                    } else {
                        data[i] = "";
                    }
                    i++;
                }
                StringBuilder sb = new StringBuilder();
                for (String d : data) {
                    sb.append(d);
                    sb.append(',');
                }
                System.out.println(sb.substring(0, sb.length() - 1));
            }
            insideElement = false;
            currentElement = null;
            attribs = null;
        }

        @Override
        public void characters(char ch[], int start, int length) {
            if (insideElement) {
                content = new String(ch, start, length);
            }
        }

        public void setHeaderArray(String[] headerArray) {
            this.headerArray = headerArray;
        }
    }
}
Trengganu answered 30/7, 2010 at 0:13 Comment(2)
Do you know how to make it work a little more generically without explicitly defining 'item' and 'root'? ie. without lines such as !"item".equalsIgnoreCase(qName) && !"root"Goodnight
Hi @Goodnight you could always do it based on tree depth, see e.g.: #6248822Trengganu
S
9

This looks like a good case for using XSL. Given your basic requirements it may be easier to get at the right nodes with XSL as compared to custom parsers or serializers. The benefit would be that your XSL could target "//Item//AverageTime" or whatever nodes you require without worrying about node depth.

UPDATE: The following is the xslt I threw together to make sure this worked as expected.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="/">
ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice
<xsl:for-each select="//Item">
<xsl:value-of select="ItemID"/><xsl:text>,</xsl:text><xsl:value-of select="//StartTime"/><xsl:text>,</xsl:text><xsl:value-of select="//EndTime"/><xsl:text>,</xsl:text><xsl:value-of select="//ViewItemURL"/><xsl:text>,</xsl:text><xsl:value-of select="//AverageTime"/><xsl:text>,</xsl:text><xsl:value-of select="//AveragePrice"/><xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>
Satiated answered 30/7, 2010 at 0:42 Comment(3)
Escpecially the requirement of "any number of nodes with any depth" should force one's thoughts towards XSL and "//Item".Pasteurization
XSL would be the perfect choice if this was a small file however, the DOM for a 1gb file could take up a huge amount of memory. So I would imagine some sort of specialized streaming XSL would need to be used (this thread already mentioned Saxonica and VTD-XML) See also: https://mcmap.net/q/909952/-xml-process-large-dataTrengganu
That is some interesting information. In that case, a streaming xsl tech would be useful. Thanks for the link Mark.Satiated
T
6

I'm not sure I understand how generic the solution should be. Do you really want to parse a 1 GB file twice for a generic solution? And if you want something generic, why did you skipped the <category> element in your example? How much different format do you need to handle? Do you really not know what the format can be (even if some element can be ommited)? Can you clarify?

To my experience, it's generally preferable to parse specific files in a specific way (this doesn't exclude using a generic API though). My answer will go in this direction (and I'll update it after the clarification).


If you don't feel comfortable with XML, you could consider using some existing (commercial) libraries, for example Ricebridge XML Manager and CSV Manager. See How to convert CSV into XML and XML into CSV using Java for a full example. The approach is pretty straightforward: you define the data fields using XPath expressions (which is perfect in your case since you can have "extra" elements), parse the the file and then pass the result List to the CSV component to generate the CSV file. The API looks simple, the code tested (the source code of their test cases is available under a BSD-style license), they claim supporting gigabyte-sized files.

You can get a Single Developer license for $170 which is not very expensive compared to developer daily rates.

They offer 30 days trial versions, have a look.


Another option would be to use Spring Batch. Spring batch offers everything required to work with XML files as input or output (using StAX and the XML binding framework of your choice) and flat files as input or output. See:


You could also use Smooks to do XML to CSV transformations. See also:


Another option would be to roll your own solution, using a StAX parser or, why not, using VTD-XML and XPath. Have a look at:

Thrum answered 30/7, 2010 at 5:37 Comment(0)
M
2

The best way to code based on your described requirement is to use the easy feature of FreeMarker and XML processing. See the docs.

In this case you will only need the template that will produce a CSV.

An alternative to this is XMLGen, but very similar in approach. Just look at that diagram and examples, and instead of SQL statements, you will output CSV.

These two similar approaches are not "conventional" but do the job very quickly for your situation, and you don't have to learn XSL (quite hard to master I think).

Mouse answered 20/7, 2010 at 19:32 Comment(0)
E
2

Here some code that implements the conversion of the XML to CSV using StAX. Although the XML you gave is only an example, I hope that this shows you how to handle the optional elements.

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import java.io.*;

public class App 
{
    public static void main( String[] args ) throws XMLStreamException, FileNotFoundException
    {
        new App().convertXMLToCSV(new BufferedInputStream(new FileInputStream(args[0])), new BufferedOutputStream(new FileOutputStream(args[1])));
    }

    static public final String ROOT = "root";
    static public final String ITEM = "Item";
    static public final String ITEM_ID = "ItemID";
    static public final String ITEM_DETAILS = "ListingDetails";
    static public final String START_TIME = "StartTime";
    static public final String END_TIME = "EndTime";
    static public final String ITEM_URL = "ViewItemURL";
    static public final String AVERAGES = "averages";
    static public final String AVERAGE_TIME = "AverageTime";
    static public final String AVERAGE_PRICE = "AveragePrice";
    static public final String SEPARATOR = ",";

    public void convertXMLToCSV(InputStream in, OutputStream out) throws XMLStreamException
    {
        PrintWriter writer = new PrintWriter(out);
        XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(in);
        convertXMLToCSV(xmlStreamReader, writer);
    }

    public void convertXMLToCSV(XMLStreamReader xmlStreamReader, PrintWriter writer) throws XMLStreamException {
        writer.println("ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice");
        xmlStreamReader.nextTag();
        xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ROOT);

        while (xmlStreamReader.hasNext()) {
            xmlStreamReader.nextTag();
            if (xmlStreamReader.isEndElement())
                break;

            xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ITEM);
            String itemID = nextValue(xmlStreamReader, ITEM_ID);
            xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ITEM_DETAILS);
            String startTime = nextValue(xmlStreamReader, START_TIME);
            xmlStreamReader.nextTag();
            String averageTime = null;
            String averagePrice = null;

            if (xmlStreamReader.getLocalName().equals(AVERAGES))
            {
                averageTime = nextValue(xmlStreamReader, AVERAGE_TIME);
                averagePrice = nextValue(xmlStreamReader, AVERAGE_PRICE);
                xmlStreamReader.nextTag();
                xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, AVERAGES);
                xmlStreamReader.nextTag();
            }
            String endTime = currentValue(xmlStreamReader, END_TIME);
            String url = nextValue(xmlStreamReader,ITEM_URL);
            xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ITEM_DETAILS);
            xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ITEM);

            writer.append(esc(itemID)).append(SEPARATOR)
                    .append(esc(startTime)).append(SEPARATOR)
                    .append(esc(endTime)).append(SEPARATOR)
                    .append(esc(url));
            if (averageTime!=null)
                writer.append(SEPARATOR).append(esc(averageTime)).append(SEPARATOR)
                        .append(esc(averagePrice));
            writer.println();                        
        }

        xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ROOT);
        writer.close();

    }

    private String esc(String string) {
        if (string.indexOf(',')!=-1)
            string = '"'+string+'"';
        return string;
    }

    private String nextValue(XMLStreamReader xmlStreamReader, String name) throws XMLStreamException {
        xmlStreamReader.nextTag();
        return currentValue(xmlStreamReader, name);
    }

    private String currentValue(XMLStreamReader xmlStreamReader, String name) throws XMLStreamException {
        xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, name);
        String value = "";
        for (;;) {
            int next = xmlStreamReader.next();
            if (next==XMLStreamConstants.CDATA||next==XMLStreamConstants.SPACE||next==XMLStreamConstants.CHARACTERS)
                value += xmlStreamReader.getText();
            else if (next==XMLStreamConstants.END_ELEMENT)
                break;
            // ignore comments, PIs, attributes
        }
        xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, name);
        return value.trim();
    }    
}
Erik answered 27/7, 2010 at 18:25 Comment(1)
Thank you for your response, I'm looking for more Generic approach, meaning that it should work for any number of nodes with any depth, and sometimes as in the example xml, it can happen that one item object has greater number of nodes than the next one so there should be also case for that. Also it can happen that nodes have the same name but different values and attributes that is the case for new column in CSV as well.Duologue
U
1

I am not convinced that SAX is the best approach for you. There are different ways you could use SAX here, though.

If element order is not guaranteed within certain elements, like ListingDetails, then you need to be proactive.

When you start a ListingDetails, initialize a map as a member variable on the handler. In each subelement, set the appropriate key-value in that map. When you finish a ListingDetails, examine the map and explicitly mock values such as nulls for the missing elements. Assuming you have one ListingDetails per item, save it to a member variable in the handler.

Now, when your item element is over, have a function that writes the line of CSVs based on the map in the order you wanted.

The risk with this is if you have corrupted XML. I would strongly consider setting all these variables to null when an item starts, and then checking for errors and announcing them when the item ends.

Undesigned answered 20/7, 2010 at 19:6 Comment(0)
I
1

Note that this would be a prime example of using XSLT except that most XSLT processors read in the whole XML file into memory which is not an option as it is large. Note, however, that the enterprise version of Saxon can do streaming XSLT processing (if the XSLT script adheres to the restrictions).

You may also want to use an external XSLT processor outside your JVM instead, if applicable. This opens up for several more options.

Streaming in Saxon-EE: http://www.saxonica.com/documentation/sourcedocs/serial.html

Ideal answered 1/8, 2010 at 13:0 Comment(2)
There is also Joost/STX joost.sourceforge.net which is an XSLT-like language with some additional constraints for streaming. Since this problem only requires sequential processing of the input, it should fit well into that model.Invagination
Why just XSLT-like instead of an XSLT subset?Specimen
V
0

You could use XStream (http://x-stream.github.io/) or JOX (http://www.wutka.com/jox.html) to recognize xml and then convert it to a Java Bean. I think you can convert the Beans to CSV automatically once you get the bean.

Vue answered 20/7, 2010 at 19:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.