I'm writing a SAX parser in Java to parse a 2.5GB XML file of wikipedia articles. Is there a way to monitor the progress of the parsing in Java?
Use a javax.swing.ProgressMonitorInputStream.
Thanks to EJP's suggestion of ProgressMonitorInputStream
, in the end I extended FilterInputStream
so that ChangeListener
can be used to monitor the current read location in term of bytes.
With this you have finer control, for example to show multiple progress bars for parallel reading of big xml files. Which is exactly what I did.
So, a simplified version of the monitorable stream:
/**
* A class that monitors the read progress of an input stream.
*
* @author Hermia Yeung "Sheepy"
* @since 2012-04-05 18:42
*/
public class MonitoredInputStream extends FilterInputStream {
private volatile long mark = 0;
private volatile long lastTriggeredLocation = 0;
private volatile long location = 0;
private final int threshold;
private final List<ChangeListener> listeners = new ArrayList<>(4);
/**
* Creates a MonitoredInputStream over an underlying input stream.
* @param in Underlying input stream, should be non-null because of no public setter
* @param threshold Min. position change (in byte) to trigger change event.
*/
public MonitoredInputStream(InputStream in, int threshold) {
super(in);
this.threshold = threshold;
}
/**
* Creates a MonitoredInputStream over an underlying input stream.
* Default threshold is 16KB, small threshold may impact performance impact on larger streams.
* @param in Underlying input stream, should be non-null because of no public setter
*/
public MonitoredInputStream(InputStream in) {
super(in);
this.threshold = 1024*16;
}
public void addChangeListener(ChangeListener l) { if (!listeners.contains(l)) listeners.add(l); }
public void removeChangeListener(ChangeListener l) { listeners.remove(l); }
public long getProgress() { return location; }
protected void triggerChanged( final long location ) {
if ( threshold > 0 && Math.abs( location-lastTriggeredLocation ) < threshold ) return;
lastTriggeredLocation = location;
if (listeners.size() <= 0) return;
try {
final ChangeEvent evt = new ChangeEvent(this);
for (ChangeListener l : listeners) l.stateChanged(evt);
} catch (ConcurrentModificationException e) {
triggerChanged(location); // List changed? Let's re-try.
}
}
@Override public int read() throws IOException {
final int i = super.read();
if ( i != -1 ) triggerChanged( location++ );
return i;
}
@Override public int read(byte[] b, int off, int len) throws IOException {
final int i = super.read(b, off, len);
if ( i > 0 ) triggerChanged( location += i );
return i;
}
@Override public long skip(long n) throws IOException {
final long i = super.skip(n);
if ( i > 0 ) triggerChanged( location += i );
return i;
}
@Override public void mark(int readlimit) {
super.mark(readlimit);
mark = location;
}
@Override public void reset() throws IOException {
super.reset();
if ( location != mark ) triggerChanged( location = mark );
}
}
It doesn't know - or care - how big the underlying stream is, so you need to get it some other way, such as from the file itself.
So, here goes the simplified sample usage:
try (
MonitoredInputStream mis = new MonitoredInputStream(new FileInputStream(file), 65536*4)
) {
// Setup max progress and listener to monitor read progress
progressBar.setMaxProgress( (int) file.length() ); // Swing thread or before display please
mis.addChangeListener( new ChangeListener() { @Override public void stateChanged(ChangeEvent e) {
SwingUtilities.invokeLater( new Runnable() { @Override public void run() {
progressBar.setProgress( (int) mis.getProgress() ); // Promise me you WILL use MVC instead of this anonymous class mess!
}});
}});
// Start parsing. Listener would call Swing event thread to do the update.
SAXParserFactory.newInstance().newSAXParser().parse(mis, this);
} catch ( IOException | ParserConfigurationException | SAXException e) {
e.printStackTrace();
} finally {
progressBar.setVisible(false); // Again please call this in swing event thread
}
In my case the progresses raise nicely from left to right without abnormal jumps. Adjust threshold for optimum balance between performance and responsiveness. Too small and the reading speed can more then double on small devices, too big and the progress would not be smooth.
Hope it helps. Feel free to edit if you found mistakes or typos, or vote up to send me some encouragements! :D
Use a javax.swing.ProgressMonitorInputStream.
You can get an estimate of the current line/column in your file by overriding the method setDocumentLocator
of org.xml.sax.helpers.DefaultHandler/BaseHandler
. This method is called with an object from which you can get an approximation of the current line/column when needed.
Edit: To the best of my knowledge, there is no standard way to get the absolute position. However, I am sure some SAX implementations do offer this kind of information.
Assuming you know how many articles you have, can't you just keep a counter in the handler? E.g.
public void startElement (String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if(qName.equals("article")){
counter++
}
...
}
(I don't know whether you are parsing "article", it's just an example)
If you don't know the number of article in advance, you will need to count it first. Then you can print the status nb tags read/total nb of tags
, say each 100 tags (counter % 100 == 0
).
Or even have another thread monitor the progress. In this case, you might want to synchronize access to the counter, but not necessary given that it doesn't need to be really accurate.
My 2 cents
I'd use the input stream position. Make your own trivial stream class that delegates/inherits from the "real" one and keeps track of bytes read. As you say, getting the total filesize is easy. I wouldn't worry about buffering, lookahead, etc. - for large files like these it's chickenfeed. On the other hand, I'd limit the position to "99%".
© 2022 - 2024 — McMap. All rights reserved.