I'm currently using a not-very-Scala-like approach to parse large Unix mailbox files. I'm still learning the language and would like to challenge myself to find a better way, however, I do not believe I have a solid grasp on just what can be done with an Iterator
and how to effectively use it.
I'm currently using org.apache.james.mime4j
, and I use the org.apache.james.mime4j.mboxiterator.MboxIterator
to get a java.util.Iterator
from a file, as so:
// registers an implementation of a ContentHandler that
// allows me to construct an object representing an email
// using callbacks
val handler: ContentHandler = new MyHandler();
// creates a parser that parses a SINGLE email from a given InputStream
val parser: MimeStreamParser = new MimeStreamParser(configBuilder.build());
// register my handler
parser.setContentHandler(handler);
// Get a java.util.Iterator
val iterator = MboxIterator.fromFile(fileName).build();
// For each email, process it using above Handler
iterator.forEach(p => parser.parse(p.asInputStream(Charsets.UTF_8)))
From my understanding, the Scala Iterator
is much more robust, and probably a lot more capable of handling something like this, especially because I won't always be able to fit the full file in memory.
I need to construct my own version of the MboxIterator
. I dug through the source for MboxIterator
and was able to find a good RegEx pattern to use to determine the beginning of individual email messages with, however, I'm drawing a blank from now on.
I created the RegEx like so:
val MESSAGE_START = Pattern.compile(FromLinePatterns.DEFAULT, Pattern.MULTILINE);
What I want to do (based on what I know so far):
- Build a
FileInputStream
from an MBOX file. - Use
Iterator.continually(stream.read())
to read through the stream - Use
.takeWhile()
to continue to read until the end of the stream - Chunk the Stream using something like
MESSAGE_START.matcher(someString).find()
, or use it to find the indexes the separate the message - Read the chunks created, or read the bits in between the indexes created
I feel like I should be able to use map()
, find()
, filter()
and collect()
to accomplish this, but I'm getting thrown off by the fact that they only give me Int
s to work with.
How would I accomplish this?
EDIT:
After doing some more thinking on the subject, I thought of another way to describe what I think I need to do:
I need to keep reading from the stream until I get a string that matches my RegEx
Maybe
group
the previously read bytes?Send it off to be processed somewhere
Remove it from the scope somehow so it doesn't get grouped the next time I run into a match
Continue to read the stream until I find the next match.
Profit???
EDIT 2:
I think I'm getting closer. Using a method like this gets me an iterator of iterators. However, there are two issues: 1. Is this a waste of memory? Does this mean everything gets read into memory? 2. I still need to figure out a way to split by the match
, but still include it in the iterator returned.
def split[T](iter: Iterator[T])(breakOn: T => Boolean):
Iterator[Iterator[T]] =
new Iterator[Iterator[T]] {
def hasNext = iter.hasNext
def next = {
val cur = iter.takeWhile(!breakOn(_))
iter.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
MboxIterator
should be properly streaming the file content (as opposed to loading it all into memory)... – Kolodgiesplit()
method might work but it appears to break the first rule of iterators: "one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods:next
andhasNext
." (From the Scaladocs page.) – Wiskind