How do I optimize working with large datasets in MongoDB

We have multiple collections of about 10,000 documents (this will become increasingly more in the future) that are generated in node.js, and need to be stored/queried/filtered/projected multiple times for which we have a mongodb aggregation pipeline. Once certain conditions are met, the documents are regenerated and stored.

Everything worked fine when we had 5,000 documents. We inserted them as an array in a single document, and used unwind in the aggregation pipeline. However, at a certain point the documents no longer fits in a single document because it exceeds the 16 MB document size limit. We needed to store everything in bulk, and add some identifiers to know what 'collection' they belong to so we can use the pipeline on those documents only.

Problem: Writing the files, which is necessary before we can query them in a pipeline, is problematically slow. The bulk.execute() part can easily take 10 - 15 seconds. Adding them to an array in node.js and writing the <16 MB doc to MongoDB only takes a fraction of a second.

bulk    = col.initializeOrderedBulkOp();

for (var i = 0, l = docs.length; i < l; i++) {
    bulk.insert({
        doc     : docs[i],
        group   : group.metadata
    });
}

bulk.execute(bulkOpts, function(err, result) {
    // ...
}

How can we address the bulk writing overhead latency?

Thoughts so far:

A memory based collection temporarily handling queries while data is being written to disk.
Figure if Memory Storage Engine (Alert: considered beta and not for production) is worth MongoDB Enterprise licensing.
Perhaps the WiredTiger storage engine has improvements over MMAPv1 other than compression and encryption.
Storing a single (array) document anyway, but split it into <16 MB chunks.

Recommended topics

Hot tags