parsing XML in Java using SAX: value cut in 2 halves
Asked Answered
L

2

5

I am trying to read a file format that is based on xml and is called mzXML using SAX in JAVA. It carries partially encoded mass spectrometric data (signals with intensities).

This is what the entry of interest looks like (there is more information around that):

    <peaks ... >eJwBgAN//EByACzkZJkHP/NlAceAXLJAckeQ4CIUJz/203q2...</peaks>

A complete file that forces the Error in my case can be downloaded here.

The String in one of these entries holds about 500 compressed and base64 encoded pairs of doubles (signals and intensities). What I do is to decompress and decode, to get the values (decoding not shown in the example below). That is all working fine on a small dataset. Now I used a bigger one and i ran into a problem that I don´t understand:

The procedure characters(ch,start,length) does not read the complete entry in the line shown before. The length-value seems to be to small.

I did not see that problem, when I just printed the peaks entry to the console, as there are a lot of letters and I did´nt recognize letters were missing. But the decompression fails, when there is information missing. When I repeatedly run this program, it always breaks the same line at the same point without giving any Exception. If I change the mzXML file by e.g. deleting a scan, it breaks at a different position. I found this out using breakpoints in the character() procedure looking at the content of currentValue

Here is the piece of code necessary to recapitulate the problem:

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.DataFormatException;
import java.util.zip.Inflater;

import javax.xml.bind.DatatypeConverter;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class ReadXMLFile {

    public static byte[] decompress(byte[] data) throws IOException, DataFormatException { 
        Inflater inflater = new Inflater();  
        inflater.setInput(data); 

        ByteArrayOutputStream outputStream = new ByteArrayOutputStream(data.length); 
        byte[] buffer = new byte[data.length*2]; 
        while (!inflater.finished()) { 
            int count = inflater.inflate(buffer); 
            outputStream.write(buffer, 0, count); 
        } 
        outputStream.close(); 
        byte[] output = outputStream.toByteArray(); 

        return output; 
    } 

    public static void main(String args[]) {

        try {

            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();

            DefaultHandler handler = new DefaultHandler() {

                boolean peaks = false;

                public void startElement(String uri, String localName,String qName, 
                        Attributes attributes) throws SAXException {

                    if (qName.equalsIgnoreCase("PEAKS")) {
                        peaks = true;
                    }
                }

                public void endElement(String uri, String localName,
                        String qName) throws SAXException {
                    if (peaks) {peaks = false;}
                }

                public void characters(char ch[], int start, int length) throws SAXException {

                    if (peaks) {
                        String currentValue = new String(ch, start, length);
                        System.out.println(currentValue);
                        try {
                            byte[] array = decompress(DatatypeConverter.parseBase64Binary(currentValue));
                            System.out.println(array[1]);

                        } catch (IOException | DataFormatException e) {e.printStackTrace();}
                        peaks = false;
                    }
                }
            };

            saxParser.parse("file1_zlib.mzxml", handler);

        } catch (Exception e) {e.printStackTrace();}
    }

}

Is there a safer way to read large xml files? Can you tell me where the error comes from or how to avoid it?

Thanks, Michael

Longoria answered 5/11, 2013 at 13:27 Comment(0)
S
7

The procedure characters(ch,start,length) does not read the complete entry in the line shown before. The length-value seems to be to small.

That is precisely the way it is desgined to work. From the documentation of ContentHandler:

SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks.

Therefore, you should not try calling decompress inside the characters implementation. Instead, you should append the characters that you get to an expandable buffer, and call decompress only when you get the corresponding endElement:

StringBuilder sb = null;

public void startElement(String uri, String localName,String qName, 
    Attributes attributes) throws SAXException {
    if (qName.equalsIgnoreCase("PEAKS")) {
        sb = new StringBuilder();
    }
}

public void endElement(String uri, String localName, String qName) throws SAXException {
    if (sb == null) return;
    try {
        byte[] array = decompress(DatatypeConverter.parseBase64Binary(sb.toString()));
        System.out.println(array[1]);
    } catch (IOException | DataFormatException e) {e.printStackTrace();}
    sb = null;
}

public void characters(char ch[], int start, int length) throws SAXException {
    if (sb == null) return;
    String currentValue = new String(ch, start, length);
    sb.appens(currentValue);
}
Schott answered 5/11, 2013 at 13:33 Comment(0)
T
0

Try this! Use a LinkedList to store the tag names at every startElement() and remove the last element using pollLast() at every endElement(). Use String.trim() to get the data from characters(). So everytime the characters() function returns some actual data (Use String.length()!=0) you can associate it with the last element (peekLast()) in the LinkedList

Then you can choose to append() it or may be do otherwise

Teletypesetter answered 28/1, 2014 at 12:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.