Can JAXB parse large XML files in chunks
Asked Answered
U

4

27

I need to parse potentially large XML files, of which the schema is already provided to me in several XSD files, so XML binding is highly favored. I'd like to know if I can use JAXB to parse the file in chunks and if so, how.

Ulland answered 15/7, 2009 at 21:26 Comment(0)
A
34

Because code matters, here is a PartialUnmarshaller who reads a big file into chunks. It can be used that way new PartialUnmarshaller<YourClass>(stream, YourClass.class)

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;
import java.io.InputStream;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import static javax.xml.stream.XMLStreamConstants.*;

public class PartialUnmarshaller<T> {
    XMLStreamReader reader;
    Class<T> clazz;
    Unmarshaller unmarshaller;

    public PartialUnmarshaller(InputStream stream, Class<T> clazz) throws XMLStreamException, FactoryConfigurationError, JAXBException {
        this.clazz = clazz;
        this.unmarshaller = JAXBContext.newInstance(clazz).createUnmarshaller();
        this.reader = XMLInputFactory.newInstance().createXMLStreamReader(stream);

        /* ignore headers */
        skipElements(START_DOCUMENT, DTD);
        /* ignore root element */
        reader.nextTag();
        /* if there's no tag, ignore root element's end */
        skipElements(END_ELEMENT);
    }

    public T next() throws XMLStreamException, JAXBException {
        if (!hasNext())
            throw new NoSuchElementException();

        T value = unmarshaller.unmarshal(reader, clazz).getValue();

        skipElements(CHARACTERS, END_ELEMENT);
        return value;
    }

    public boolean hasNext() throws XMLStreamException {
        return reader.hasNext();
    }

    public void close() throws XMLStreamException {
        reader.close();
    }

    void skipElements(int... elements) throws XMLStreamException {
        int eventType = reader.getEventType();

        List<Integer> types = asList(elements);
        while (types.contains(eventType))
            eventType = reader.next();
    }
}
Aftershock answered 13/2, 2012 at 11:53 Comment(4)
I need to use XMLStreamConstants.START_DOCUMENT and so on for this to work.Clerkly
@MatthiasWuttke you can add them as a static import. import static javax.xml.stream.XMLStreamConstants.*;Sulfapyridine
You may also need Guava's Ints.asList or in java8 IntStream.of(elements).boxed().collect(Collectors.toList());Sulfapyridine
Hi, Thanks for this response. I am doing something similar but unable to marshall few events to my class. If you get a chance can you please have a look at this question and provide your observation? #67668016Transsonic
W
20

This is detailed in the user guide. The JAXB download from http://jaxb.java.net/ includes an example of how to parse one chunk at a time.

When a document is large, it's usually because there's repetitive parts in it. Perhaps it's a purchase order with a large list of line items, or perhaps it's an XML log file with large number of log entries.

This kind of XML is suitable for chunk-processing; the main idea is to use the StAX API, run a loop, and unmarshal individual chunks separately. Your program acts on a single chunk, and then throws it away. In this way, you'll be only keeping at most one chunk in memory, which allows you to process large documents.

See the streaming-unmarshalling example and the partial-unmarshalling example in the JAXB RI distribution for more about how to do this. The streaming-unmarshalling example has an advantage that it can handle chunks at arbitrary nest level, yet it requires you to deal with the push model --- JAXB unmarshaller will "push" new chunk to you and you'll need to process them right there.

In contrast, the partial-unmarshalling example works in a pull model (which usually makes the processing easier), but this approach has some limitations in databinding portions other than the repeated part.

Warehouseman answered 15/7, 2009 at 21:29 Comment(3)
Right, that's one of the sites I found when researching this, but I was unable to find the "streaming-unmarshalling" and "partial-unmarshalling" examples it referred to in section 4.4.1.Ulland
Odd. Where are you looking? I just downloaded the JAR from jaxb.dev.java.net/2.1.12, unpacked it, and there under "samples" is "partial-unmarshalling" and "stream-unmarshalling".Warehouseman
You can find the examples on GitHub: github.com/javaee/jaxb-v2/tree/master/jaxb-ri/samples/src/main/… and github.com/javaee/jaxb-v2/tree/master/jaxb-ri/samples/src/main/…Diablerie
W
3

Yves Amsellem's answer is pretty good, but only works if all elements are of exactly the same type. Otherwise your unmarshall will throw an exception, but the reader will have already consumed the bytes, so you would be unable to recover. Instead, we should follow Skaffman's advice and look at the sample from the JAXB jar.

To explain how it works:

  1. Create a JAXB unmarshaller.
  2. Add a listener to the unmarshaller for intercepting the appropriate elements. This is done by "hacking" the ArrayList to ensure the elements are not stored in memory after being unmarshalled.
  3. Create a SAX parser. This is where the streaming happens.
  4. Use the unmarshaller to generate a handler for the SAX parser.
  5. Stream!

I modified the solution to be generic*. However, it required some reflection. If this is not OK, please look at the code samples in the JAXB jars.

ArrayListAddInterceptor.java

import java.lang.reflect.Field;
import java.util.ArrayList;

public class ArrayListAddInterceptor<T> extends ArrayList<T> {
    private static final long serialVersionUID = 1L;

    private AddInterceptor<T> interceptor;

    public ArrayListAddInterceptor(AddInterceptor<T> interceptor) {
        this.interceptor = interceptor;
    }

    @Override
    public boolean add(T t) {
        interceptor.intercept(t);
        return false;
    }

    public static interface AddInterceptor<T> {
        public void intercept(T t);
    }

    public static void apply(AddInterceptor<?> interceptor, Object o, String property) {
        try {
            Field field = o.getClass().getDeclaredField(property);
            field.setAccessible(true);
            field.set(o, new ArrayListAddInterceptor(interceptor));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

}

Main.java

public class Main {
  public void parsePurchaseOrders(AddInterceptor<PurchaseOrder> interceptor, List<File> files) {
        try {
            // create JAXBContext for the primer.xsd
            JAXBContext context = JAXBContext.newInstance("primer");

            Unmarshaller unmarshaller = context.createUnmarshaller();

            // install the callback on all PurchaseOrders instances
            unmarshaller.setListener(new Unmarshaller.Listener() {
                public void beforeUnmarshal(Object target, Object parent) {
                    if (target instanceof PurchaseOrders) {
                        ArrayListAddInterceptor.apply(interceptor, target, "purchaseOrder");
                    }
                }
            });

            // create a new XML parser
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setNamespaceAware(true);
            XMLReader reader = factory.newSAXParser().getXMLReader();
            reader.setContentHandler(unmarshaller.getUnmarshallerHandler());

            for (File file : files) {
                reader.parse(new InputSource(new FileInputStream(file)));
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

*This code has not been tested and is for illustrative purposes only.

Wheelchair answered 10/10, 2015 at 17:14 Comment(0)
S
2

I wrote a small library (available on Maven Central) to help to read big XML files and process them by chunks. Please note it can only be applied for files with a unique container having a list of data (even from different types). In other words, your file has to follow the structure:

<container>
   <type1>...</type1>
   <type2>...</type2>
   <type1>...</type1>
   ...
</container>

Here is an example where Type1, Type2, ... are the JAXB representation of the repeating data in the file:

try (StreamingUnmarshaller unmarshaller = new StreamingUnmarshaller(Type1.class, Type2.class, ...)) {
    unmarshaller.open(new FileInputStream(fileName));
    unmarshaller.iterate((type, element) -> doWhatYouWant(element));
}

You can find more information with detailed examples on the GitHub page of the library.

Springwood answered 28/9, 2020 at 0:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.