I am trying to read a file format that is based on xml and is called mzXML using SAX in JAVA. It carries partially encoded mass spectrometric data (signals with intensities).
This is what the entry of interest looks like (there is more information around that):
<peaks ... >eJwBgAN//EByACzkZJkHP/NlAceAXLJAckeQ4CIUJz/203q2...</peaks>
A complete file that forces the Error in my case can be downloaded here.
The String in one of these entries holds about 500 compressed and base64 encoded pairs of doubles (signals and intensities). What I do is to decompress and decode, to get the values (decoding not shown in the example below). That is all working fine on a small dataset. Now I used a bigger one and i ran into a problem that I don´t understand:
The procedure characters(ch,start,length) does not read the complete entry in the line shown before. The length-value seems to be to small.
I did not see that problem, when I just printed the peaks entry to the console, as there are a lot of letters and I did´nt recognize letters were missing. But the decompression fails, when there is information missing. When I repeatedly run this program, it always breaks the same line at the same point without giving any Exception. If I change the mzXML file by e.g. deleting a scan, it breaks at a different position. I found this out using breakpoints in the character() procedure looking at the content of currentValue
Here is the piece of code necessary to recapitulate the problem:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.DataFormatException;
import java.util.zip.Inflater;
import javax.xml.bind.DatatypeConverter;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class ReadXMLFile {
public static byte[] decompress(byte[] data) throws IOException, DataFormatException {
Inflater inflater = new Inflater();
inflater.setInput(data);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream(data.length);
byte[] buffer = new byte[data.length*2];
while (!inflater.finished()) {
int count = inflater.inflate(buffer);
outputStream.write(buffer, 0, count);
}
outputStream.close();
byte[] output = outputStream.toByteArray();
return output;
}
public static void main(String args[]) {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
boolean peaks = false;
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("PEAKS")) {
peaks = true;
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
if (peaks) {peaks = false;}
}
public void characters(char ch[], int start, int length) throws SAXException {
if (peaks) {
String currentValue = new String(ch, start, length);
System.out.println(currentValue);
try {
byte[] array = decompress(DatatypeConverter.parseBase64Binary(currentValue));
System.out.println(array[1]);
} catch (IOException | DataFormatException e) {e.printStackTrace();}
peaks = false;
}
}
};
saxParser.parse("file1_zlib.mzxml", handler);
} catch (Exception e) {e.printStackTrace();}
}
}
Is there a safer way to read large xml files? Can you tell me where the error comes from or how to avoid it?
Thanks, Michael