How do I optimize working with large datasets in MongoDB
Asked Answered
D

0

6

We have multiple collections of about 10,000 documents (this will become increasingly more in the future) that are generated in node.js, and need to be stored/queried/filtered/projected multiple times for which we have a mongodb aggregation pipeline. Once certain conditions are met, the documents are regenerated and stored.

Everything worked fine when we had 5,000 documents. We inserted them as an array in a single document, and used unwind in the aggregation pipeline. However, at a certain point the documents no longer fits in a single document because it exceeds the 16 MB document size limit. We needed to store everything in bulk, and add some identifiers to know what 'collection' they belong to so we can use the pipeline on those documents only.

Problem: Writing the files, which is necessary before we can query them in a pipeline, is problematically slow. The bulk.execute() part can easily take 10 - 15 seconds. Adding them to an array in node.js and writing the <16 MB doc to MongoDB only takes a fraction of a second.

bulk    = col.initializeOrderedBulkOp();

for (var i = 0, l = docs.length; i < l; i++) {
    bulk.insert({
        doc     : docs[i],
        group   : group.metadata
    });
}

bulk.execute(bulkOpts, function(err, result) {
    // ...
}

How can we address the bulk writing overhead latency?


Thoughts so far:

  • A memory based collection temporarily handling queries while data is being written to disk.
  • Figure if Memory Storage Engine (Alert: considered beta and not for production) is worth MongoDB Enterprise licensing.
  • Perhaps the WiredTiger storage engine has improvements over MMAPv1 other than compression and encryption.
  • Storing a single (array) document anyway, but split it into <16 MB chunks.
Dirndl answered 12/4, 2016 at 20:27 Comment(7)
I think in general your problems stem from overembedding. 5k docs, and even 10k docs is something I'd do on my Raspi without a hassle. I know that I deal with much larger data sets (multiple orders of magnitude) on my 4GB laptop. However, data (re)modeling is a task not to be taken lightly. I strongly suggest to get yourself a specialist who helps you with that. As per your question: Please ask only one at a time, as I am sure you know. The way it is now, it is way too broad.Hessney
I'd also read the Memory section on the WiredTiger storage engine.Tomblin
I edited my question to cut out a lot of the verbosity. Please consider cancelling the vote to close.Dirndl
Let me try to understand better. The issue is this a single mongo document contains an array of multiple of what you consider individual docs in your domain model, so the mongo doc is > 16MB? If so, can't you remodel the input so that the domain model doc coincide with a mongo doc?Dogoodism
@DanieleDellafiore that was the old situation. And it was fast. Then things became too big so I remodelled the input. Now I do bulk inserts, and this is slow, probably 50+ times slower. That's what the question is about.Dirndl
@redsandro so in the end you have 10k docs, JSON, about 10MB per document. Your store procedure with bulk write (the code in the question) takes 15 secs. Correct?Dogoodism
@DanieleDellafiore no. The combined documents exceed 16 MB. Say 30 MB. So I can no longer combine all documents into one document. Now I need to store them individually in bulk. That's about 3 kilobyte per document.Dirndl

© 2022 - 2024 — McMap. All rights reserved.