How to Parse Big (50 GB) XML Files in Java
Asked Answered
C

2

26

Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.

Is there any way to speed this up? A better method?

Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.

Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000

Main:

public class Read {
    public static void main(String[] args) {       
       pages = XMLManager.getPages();
    }
}

XMLManager

public class XMLManager {
    public static ArrayList<Page> getPages() {

    ArrayList<Page> pages = null; 
    SAXParserFactory factory = SAXParserFactory.newInstance();

    try {

        SAXParser parser = factory.newSAXParser();
        File file = new File("..\\enwiki-20140811-pages-articles.xml");
        PageHandler pageHandler = new PageHandler();

        parser.parse(file, pageHandler);
        pages = pageHandler.getPages();

    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }


    return pages;
    }    
}

PageHandler

public class PageHandler extends DefaultHandler{

    private ArrayList<Page> pages = new ArrayList<>();
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(){
        super();
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

        stringBuilder = new StringBuilder();

         if (qName.equals("page")){

            page = new Page();
            idSet = false;

        } else if (qName.equals("redirect")){
             if (page != null){
                 page.setRedirecting(true);
             }
        }
    }

     @Override
     public void endElement(String uri, String localName, String qName) throws SAXException {

         if (page != null && !page.isRedirecting()){

             if (qName.equals("title")){

                 page.setTitle(stringBuilder.toString());

             } else if (qName.equals("id")){

                 if (!idSet){

                     page.setId(Integer.parseInt(stringBuilder.toString()));
                     idSet = true;

                 }

             } else if (qName.equals("text")){

                 String articleText = stringBuilder.toString();

                 articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
                 articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}", " "); //remove links underneath headings
                 articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
                 articleText = articleText.replaceAll("\\|", " "); //Separate multiple links
                 articleText = articleText.replaceAll("\\n", " "); //remove new lines
                 articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]", " "); //remove all non alphanumeric except dashes and spaces
                 articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space

                 Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text
                 Matcher matcher = pattern.matcher(articleText);
                 matcher.find();

                 try {
                     page.setSummaryText(matcher.group());
                 } catch (IllegalStateException se){
                     page.setSummaryText("None");
                 }
                 page.setText(articleText);

             } else if (qName.equals("page")){

                 pages.add(page);
                 page = null;

            }
        } else {
            page = null;
        }
     }

     @Override
     public void characters(char[] ch, int start, int length) throws SAXException {
         stringBuilder.append(ch,start, length); 
     }

     public ArrayList<Page> getPages() {
         return pages;
     }
}
Coblenz answered 11/10, 2014 at 2:57 Comment(9)
Are you sure that what's "freezing up" (want to give us any more details about what that means for your situation?) is the SAX parser rather than something in your code? Are you keeping objects in memory anywhere in your application?Bergmans
Im just running some tests on it at the moment, but i have a feeling it may have been eclipse that was freezing up (Stripoped it to bare bones and it sitll froze up). Running it through commandline at the moment, keep you posted.Coblenz
Added some basic code that just outputs what article the reader is up to within the xml fileCoblenz
Clear the StringBuilder at the end of the endElement() routine. You actually need a stack of string builders to handle nested elements properly.Colon
Isn't stringBuilder = new StringBuilder(); in startElement "clearing" it?Coblenz
Yes but in the wrong place and time. In any case you need to implement the stack. Push it in startElement(), pop it in endElement(), and append to the one at the top of the stack in characters().Colon
Sorry i'm not to sure what you mean by a stack of string builders?Coblenz
So what are you doing with the content of stringBuilder? I don't see you use it anywhere, and I'd be concerned that you're keeping it in memory (which could easily cause an OOM with a 50GB file). Can you post that code?Bergmans
So I've posted the full code, Main is a lot more complex that what is displayed it but havn't included it as its not what is causing the error. As you can see with the page handler its creating a big array list of all the pages/articles it finds inside the xml file. Im fairly certain that that arraylist is what is causing the GC overhead limit error at the moment but i'm not sure how i could avoid it?Coblenz
M
33

Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList.

You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.

What I've sometimes done for this sort of situation is similar to the following.

Create an interface for processing a single element:

public interface PageProcessor {
    void process(Page page);
}

Supply an implementation of this to the PageHandler through a constructor:

public class Read  {
    public static void main(String[] args) {

        XMLManager.load(new PageProcessor() {
            @Override
            public void process(Page page) {
                // Obviously you want to do something other than just printing, 
                // but I don't know what that is...
                System.out.println(page);
           }
        }) ;
    }

}


public class XMLManager {

    public static void load(PageProcessor processor) {
        SAXParserFactory factory = SAXParserFactory.newInstance();

        try {

            SAXParser parser = factory.newSAXParser();
            File file = new File("pages-articles.xml");
            PageHandler pageHandler = new PageHandler(processor);

            parser.parse(file, pageHandler);

        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Send data to this processor instead of putting it in the list:

public class PageHandler extends DefaultHandler {

    private final PageProcessor processor;
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(PageProcessor processor) {
        this.processor = processor;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
            //  Elide code not needing change

            } else if (qName.equals("page")){

                processor.process(page);
                page = null;

            }
        } else {
            page = null;
        }
    }

}

Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler collect pages locally in a smaller list and periodically send the list off for processing and clear the list.

Or (perhaps better) you could implement the PageProcessor interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.

Metallo answered 19/10, 2014 at 16:2 Comment(2)
This is what i have ended up doing, 10,000 pages at a time works well.Coblenz
I've just started looking at a similar task. Is that a custom Page class?Nuli
A
0

Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType has its Java POJO equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.

XML2J uses mapping of complexTypes to Java POJOs on the one hand, but lets you specify events you want to listen on. E.g.

account/@process = true
account/accounts/@process = true
account/accounts/@detach = true

The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.

class AccountType {
    private List<AccountType> accounts = new ArrayList<>();

    public void addAccount(AccountType tAccount) {
        accounts.add(tAccount);
    }
    // etc.
};

In your code you need to implement the process method (by default the code generator generates an empty method:

class AccountsProcessor implements MessageProcessor {
    static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);

    // assuming Spring data persistency here
    final String path = new ClassPathResource("spring-config.xml").getPath();
    ClassPathXmlApplicationContext context = new   ClassPathXmlApplicationContext(path);
    AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);


    @Override
    public void process(XMLEvent evt, ComplexDataType data)
        throws ProcessorException {

        if (evt == XMLEvent.END) {
            if( data instanceof AccountType) {
                process((AccountType)data);
            }
        }
    }

    private void process(AccountType data) {
        if (logger.isInfoEnabled()) {
            // do some logging
        }
        repo.save(data);
    }
}   

Note that XMLEvent.END marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END you would then update the parent.

Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.

There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.

The default process method is like this:

@Override
public void process(XMLEvent evt, ComplexDataType data)
    throws ProcessorException {


/*
 *  TODO Auto-generated method stub implement your own handling here.
 *  Use the runtime configuration file to determine which events are to be sent to the processor.
 */ 

    if (evt == XMLEvent.END) {
        data.print( ConsoleWriter.out );
    }
}

Downloads:

First mvn clean install the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME as per directions in the usermanual.

Abscise answered 27/1, 2016 at 15:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.