Comparing MongoDB and RethinkDB Bulk Insert Performance

Asked 1/3, 2013 at 6:3 Answered 7/11, 2015 at 0:55

Solved python mongodb python-2.7 rethinkdb nosql

This is my official first question here; I welcome any/all criticism of my post so that I can learn how to be a better SO citizen.

I am vetting non-relational DBMS for storing potentially large email opt-out lists, leaning toward either MongoDB or RethinkDB, using their respective Python client libraries. The pain point of my application is bulk insert performance, so I have set up two Python scripts to insert 20,000 records in batches of 5,000 into both a MongoDB and a RethinkDB collection.

The MongoDB python script mongo_insert_test.py:

NUM_LINES = 20000
BATCH_SIZE = 5000

def insert_records():
    collection = mongo.recips
    i = 0
    batch_counter = 0
    batch = []
    while i <= NUM_LINES:
        i += 1
        recip = {
            'address': "test%d@test%d.com" % (i, i)
        }
        if batch_counter <= BATCH_SIZE:
            batch.append(recip)
            batch_counter += 1
        if (batch_counter == BATCH_SIZE) or i == NUM_LINES:
            collection.insert(batch)
            batch_counter = 0
            batch = []

if __name__ == '__main__':
    insert_records()

The almost identical RethinkDB python script rethink_insert_test.py:

NUM_LINES = 20000
BATCH_SIZE = 5000

def insert_records():
    i = 0
    batch_counter = 0
    batch = []
    while i <= NUM_LINES:
        i += 1
        recip = {
            'address': "test%d@test%d.com" % (i, i)
        }
        if batch_counter <= BATCH_SIZE:
            batch.append(recip)
            batch_counter += 1
        if (batch_counter == BATCH_SIZE) or i == NUM_LINES:
            r.table('recip').insert(batch).run()
            batch_counter = 0
            batch = []

if __name__ == '__main__':
    insert_records()

In my dev environment, the MongoDB script inserts 20,000 records in under a second:

$ time python mongo_insert_test.py 
real    0m0.618s
user    0m0.400s
sys     0m0.032s

In the same environment, the RethinkDB script performs much slower, inserting 20,000 records in over 2 minutes:

$ time python rethink_insert_test.py
real    2m2.502s
user    0m3.000s
sys     0m0.052s

Am I missing something huge here with regard to how these two DBMS work? Why is RethinkDB performing so badly with this test?

My dev machine had about 1.2GB available memory for these tests.

Under answered 1/3, 2013 at 6:3 Comment(4)

The title says it all ;) rethinkdb.com/blog/… – Staceystaci 21/3, 2013 at 16:44

@FabianoPS, I wasn't really aiming for anything scientific here; just wondering why it took an order of magnitude more time to do the same operation in Rethink as MongoDB. I don't believe network/disk latency played much (if any) role. Rethink has published an update which addresses this in a hopefully meaningful (non-superficial) way. – Under 22/3, 2013 at 17:26

Hi @njyunis, this article is much interesting in it's core message: there is a lot of was to interpret speed for writing, this is a comment not an answer – Staceystaci 22/3, 2013 at 23:46

@Under Could you try again with the latest RethinkDB version to let us know of the difference? – Onega 7/9, 2015 at 18:54

RethinkDB currently implements batch inserts by doing a single insert at a time on the server. Since Rethink flushes every record to disk (because it's designed with safety first in mind), this has a really bad effect on workloads like this one.

We're doing two things to address this:

Bulk inserts will be implemented via a bulk insert algorithm on the server to avoid doing one insert at a time.
We will give you the option to relax durability constraints to allow the cache memory to absorb high-throughput inserts if you'd like (in exchange for not syncing to disk as often).

This will absolutely be fixed in 4-12 weeks (and if you need this ASAP, feel free to shoot me an email to [email protected] and I'll see if we can reprioritize).

Here are the relevant github issues:

https://github.com/rethinkdb/rethinkdb/issues/207

https://github.com/rethinkdb/rethinkdb/issues/314

Hope this helps. Please don't hesitate to ping us if you need help.

Pontone answered 1/3, 2013 at 6:49 Comment(2)

Thanks @coffeemug, this answers my question. I'll keep an eye on github so I know when this has been addressed, as I'm excited to give Rethink a try. – Under 1/3, 2013 at 14:50

@njyunis: 1.5 recently came out with faster bulk inserts: rethinkdb.com/blog/1.5-release. – Zendah 27/5, 2013 at 7:8

Leaving aside what coffemug posted:

depending on what driver version you are using and how you configure the connection to mongodb, those inserts might not even be acknowledged by the server. If you are using the last version of the Python driver, those operations are waiting just for a receipt acknowledgement from the server (which doesn't mean that data has been even written to memory). For more details to what I'm referring to check out the Mongodb write concern setting
you could get a speed up in Rethinkdb's case by parallelizing the inserts. Basically if you'd run multiple processes/threads you'll see the speed going up. In case of Mongo, due to the locks involved, parallelism will not help.

That being said, RethinkDB could improve the speed of writes.

PS: I am working for Rethink, but the above points are based on my unbiased knowledge of both systems.

Latent answered 1/3, 2013 at 7:55 Comment(2)

Parallelism with pymongo should also have some improvements as time over the wire would block less. – Woolgrower 1/3, 2013 at 9:59

I'm going to update my mongo test script to use MongoClient instead to address this ack issue. I'll edit my post when I've had a chance to do this. – Under 1/3, 2013 at 14:51

Pymongo developer here - just in case you are not doing so please make sure that you are using the latest pymongo version and MongoClient or MongoRepicaSetClient so your writes are acknowledged and not fire and forget. As @Alex says, they will most likely be what you require.

Other considerations I would have are: is this the primary use case for the database or just the core pain point? You may want to consider other data patterns, querying the data, ease of use and maintainability before making your decision.

Woolgrower answered 1/3, 2013 at 10:8 Comment(1)

Yes, I realized that after I did the test that I was using a rather outdated mongo client; thanks for pointing that out. I'm using a Postgres db for general Django relational stuff, and nosql for email list storage, with importing (bulk inserts) and scrubbing (some kind of map/reduce to compare lists) being the two main use cases for the nosql db. – Under 1/3, 2013 at 14:45

Please pardon the analogy - however it makes my point crystal clear.

It doesn't take much time to lock something valuable in a safe, but do it thousands of times over will. If you posted to a bank's vault instead, consider the time your valuable isn't secure during it's journey to your bank; that parcel would probably be piled with many of other parcels - from like-minded depositor. Someone will get round to opening your parcel, and then pile it with other things to be placed in a secure vault.

Therein lies the difference between regular commits of data to disk, and batching or lazily writing data to disk. It's a trade-off between higher data integrity and improved write performance. If the loss of data doesn't matter so much, then it's perfectly acceptable to sync to disk less frequently, batch or lazily write updates. Making the wrong choice will bite you in the bum one day, so choose wisely!

Typo answered 7/11, 2015 at 0:55 Comment(0)

Recommended topics

Hot tags