How to Speed Up Mongodump, Dump Not Finishing
Asked Answered
G

1

12

In trying to run a database dump using a query from a db of about 5 billion, the progress bar times seem to indicate that this dump won't finish in any reasonable time (100+ days). The query also froze after it seems to have ended at 0%, around 22 or so hours later - the line after is a metadata.json line.

The dump line is:

mongodump -h myHost -d myDatabase -c mycollection --query "{'cr' : {\$gte: new Date(1388534400000)}, \$or: [ { 'tln': { \$lte: 0., \$gte: -100.}, 'tlt': { \$lte: 100, \$gte: 0} }, { 'pln': { \$lte: 0., \$gte: -100.}, 'plt': { \$lte: 100, \$gte: 0} } ] }"

And my last few lines of output was (typed as I can't post images yet.)

[timestamp] Collection File Writing Progress: 10214400/5066505869 0% (objects)
[timestamp] Collection File Writing Progress: 10225100/5066505869 0% (objects)
[timestamp] 10228391 objects
[timestamp] Metadata for database.collection to dump/database/collection.metadata.json

Any thoughts to help improve performance or any idea on why this is taking so long?

Gallegos answered 19/1, 2015 at 3:41 Comment(3)
Write this query in mongo shell, use explain() to see what the query plan is - it might be that the query is slow by itselfMammon
In addition to the output of explain(), can you also confirm what version of MongoDB you are using? How many output results are you expecting from your 5 billion source documents? It's unclear if the ~10 million objects might actually be your full result set as the last line referencing metadata.json is normally emitted when the dump completes for a given collection.Relax
Hi all, So even the .explain() is taking a very long time (couple hours+) to run. Is this usual? I even tried simplifying the .explain() to just a location filter, (if the x-coordinates and y-coordinates are both within a range) and still takes a while. Will continue to run and update.Gallegos
L
20

I've just faced this issue, and the problem is that mongodump is basically not very smart. It's traversing the _id index, which likely means lots and lots and lots of random disk access. For me, dumping several collections, mongodump was simply crashing due to cursor timeouts.

The issue is also described here: https://jira.mongodb.org/browse/TOOLS-845. However, that doesn't really provide a great resolution part from "Works as Designed". It's possible there's something funny about the index, but I think in my case it was just a large enough collection that the amount of disk access was seriously hard work for my poor little Mac Mini.

One solution? Shut down writing and then use --forceTableScan, which makes a sequential pass through the data, which might well be faster than using the _id index if you are using a custom _id field (I was).

The docs are a bit sketchy, but it reads as if the normal mongodump behaviour might be to traverse the _id index using a snapshot and then filter by the query. In other words, it might be traversing all 5 billion records in _id order, not stored data order, i.e., randomly, to complete the query. So you might be better building a tool that reads from a real index and writes directly.

For me, --forceTableScan was enough, and it meant (a) it actually completes successfully, and (b) it's an order of magnitude or more faster.

Leeway answered 26/6, 2017 at 20:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.