What is the fastest way to write a lot of documents to Firestore?
Asked Answered
C

5

51

I need to write a large number of documents to Firestore.

What is the fastest way to do this in Node.js?

Clerkly answered 17/11, 2019 at 3:36 Comment(0)
C
99

TL;DR: The fastest way to perform bulk date creation on Firestore is by performing parallel individual write operations.

Writing 1,000 documents to Firestore takes:

  1. ~105.4s when using sequential individual write operations
  2. ~ 2.8s when using (2) batched write operations
  3. ~ 1.5s when using parallel individual write operations

There are three common ways to perform a large number of write operations on Firestore.

  1. Perform each individual write operation in sequence.
  2. Using batched write operations.
  3. Performing individual write operations in parallel.

We'll investigate each in turn below, using an array of randomized document data.


Individual sequential write operations

This is the simplest possible solution:

async function testSequentialIndividualWrites(datas) {
  while (datas.length) {
    await collection.add(datas.shift());
  }
}

We write each document in turn, until we've written every document. And we wait for each write operation to complete before starting on the next one.

Writing 1,000 documents takes about 105 seconds with this approach, so the throughput is roughly 10 document writes per second.


Using batched write operations

This is the most complex solution.

async function testBatchedWrites(datas) {
  let batch = admin.firestore().batch();
  let count = 0;
  while (datas.length) {
    batch.set(collection.doc(Math.random().toString(36).substring(2, 15)), datas.shift());
    if (++count >= 500 || !datas.length) {
      await batch.commit();
      batch = admin.firestore().batch();
      count = 0;
    }
  }
}

You can see that we create a BatchedWrite object by calling batch(), fill that until its maximum capacity of 500 documents, and then write it to Firestore. We give each document a generated name that is relatively likely to be unique (good enough for this test).

Writing 1,000 document takes about 2.8 seconds with this approach, so the throughput is roughly 357 document writes per second.

That's quite a bit faster than with the sequential individual writes. In fact: many developers use this approach because they assume it is fastest, but as the results above already showed this is not true. And the code is by far the most complex, due to the size constraint on batches.


Parallel individual write operations

The Firestore documentation says this about the performance for adding lots of data:

For bulk data entry, use a server client library with parallelized individual writes. Batched writes perform better than serialized writes but not better than parallel writes.

We can put that to the test with this code:

async function testParallelIndividualWrites(datas) {
  await Promise.all(datas.map((data) => collection.add(data)));
}

This code kicks of the add operations as fast as it can, and then uses Promise.all() to wait until they're all finished. With this approach the operations can run in parallel.

Writing 1,000 document takes about 1.5 seconds with this approach, so the throughput is roughly 667 document writes per second.

The difference isn't nearly as great as between the first two approaches, but it still is over 1.8 times faster than batched writes.


A few notes:

  • You can find the full code of this test on Github.
  • While the test was done with Node.js, you're likely to get similar results across all platforms that the Admin SDK supports.
  • Don't perform bulk inserts using client SDKs though, as the results may be very different and much less predictable.
  • As usual the actual performance depends on your machine, the bandwidth and latency of your internet connection, and many other factors. Based on those you may see differences in the differences too, although I expect the ordering to remain the same.
  • If you have any outliers in your own tests, or find completely different results, leave a comment below.
  • Batched writes are atomic. So if you have dependencies between the documents and all documents must be written, or none of them must be written, you should use a batched write.
Clerkly answered 17/11, 2019 at 3:36 Comment(21)
This is super interesting, thank you for doing the work! OOC, did you test running the batched writes in parallel? Obviously, in that case you would need to be even more sure to avoid any document being in both batches.Cleek
I was about to test parallel batched writes, but ran out of quota (it's a free project, and I was too lazy to upgrade). Today is another day, so I might give it a try, and update my answer if it's significant.Clerkly
@Cleek I just tested with parallel batched writes too. The performance is very similar to the individual parallel writes, so I'd say they're tied for first in my tests. I do expect that batched writes may deteriorate faster due to the nature they're processed on the back-end. Combined with the much more complex code, I'd still recommend only using them for their atomicity and not the perceived-but-non-existent performance advantage.Clerkly
@FrankvanPuffelen parallelized writes will be faster also if I "set" documents instead of "add" documents? I mean, db.collection('cities').doc('LA').set(data) instead of db.collection('cities').add(data)Dogtrot
Calling add() does nothing more than generate a unique ID (purely client-side), followed by a set() operation. So the results should be the same. If that's not what you observe, post a new question with the minimal case that reproduces what you have tried.Clerkly
Very insightful Frank! I was wondering why my writes were so slow. This should be part of the write documentation IMO. Question, is it possible to get the auto generated unique ids for each doc when using any of the fast options? I need those references stored :)Claudette
In my tests I generated unique IDs with Math.random().toString(36).substring(2, 15), which I'm pretty sure I copied from the SDK source code. :)Clerkly
Haha great! Dankjewel @FrankClaudette
@FrankvanPuffelen I've been benchtesting these methods (parallel batch writes vs parallel individual writes) using cloud functions and have found the performance for both terrible, even after accounting for the cold start delay (around 5 seconds each time). Also the batch method is 3x faster than the individual method (using set() not add()) for some reason. Writing 1200 docs takes around 7 seconds using batch and 22 seconds using individual writes. Ran each of the tests multiple times to ensure it wasn't some sort of one-off anomaly. Issue with cloud functions perhaps? Perplexing.Illyrian
In addition to above, I use set() as I am cloning a collection, and it contains the firestore-generated 20 char non-sequential uids. The function is quite simple - not too different to your examples. Perhaps I should raise this as a separate question, with code examples.Illyrian
Sounds like you got some interesting results @DG. Can you write them up in an answer? I'm quite sure folks would love to see what you've done, and what results you got.Clerkly
As requested @FrankvanPuffelen I've written these results up in an answer below. My benchmarks were a little faster overall today, but the differences between parallel batch writes and parallel individual writes is worse than I previously reported. Needless to say I'm sticking to batch writes in cloud functions for now!Illyrian
How to do parallel writes like this in python?Lapith
One big reason developers may prefer the batched writes: it allows interruptions to be handled and progress to be known before all documents are loaded. If I'm loading millions of documents, I want to know exactly how far along the load is (especially since we're talking <1k documents per second in best case), and if something goes wrong after 2M documents, I want to know I can skip to that point safely and resume. Neither of those are possible with the parallel Promise-based approach.Entrails
You can track progress on an individual level exactly the same as you can track it per batch. Firestore reports success/failure in the same way for both cases. So while there are definitely good reasons for using batches for such a process, this would not be one for me. If you're having trouble implementing progress tracking per individual document, I recommend posting a new question with a reproduction of that problem.Clerkly
heads up that Promise.all is rejected if any of the elements are rejected. you may use Promise.allSettled instead.Interpreter
What about using transactions? I wonder if it's better than batch writing.Erin
That seems very unlikely as the overhead for a transaction is even bigger than for a batched write. The recommendation is/remains to use individual, parallel document writes as they give the best options for parallelism and require the least overhead on the server.Clerkly
@FrankvanPuffelen This is interesting. My main concern is the cost as the number of users in my application grows. Aren't each individual write in the parallel writes solution going to open a new connection? As the name implies they are parallel and "batch" implies they are sent in bulk, or is firebase SDK taking care of that on the same channel locally on the client-side?Erin
If that was the case, a batched write would always be faster than parallel writes, and my answer shows that is not the case. It's not even close.Clerkly
Real helpful. I had to batch the promise.all individual writes, since I had 60k items to import and kept running out of memory (even after upping it to 2GB) at about 20k items.Alixaliza
I
7

As noted in a comment to the OP, I've had the opposite experience when writing documents to Firestore inside a Cloud Function.

TL;DR: Parallel individual writes are over 5x slower than parallel batch writes when writing 1200 documents to Firestore.

The only explanation I can think of for this, is some sort of bottleneck or request rate limiting happening between Google cloud functions and Firestore. It's a bit of a mystery.

Here's the code for both methods I benchmarked:

const functions = require('firebase-functions');
const admin = require('firebase-admin');


admin.initializeApp();
const db = admin.firestore();


// Parallel Batch Writes
exports.cloneAppBatch = functions.https.onCall((data, context) => {

    return new Promise((resolve, reject) => {

        let fromAppKey = data.appKey;
        let toAppKey = db.collection('/app').doc().id;


        // Clone/copy data from one app subcollection to another
        let startTimeMs = Date.now();
        let docs = 0;

        // Write the app document (and ensure cold start doesn't affect timings below)
        db.collection('/app').doc(toAppKey).set({ desc: 'New App' }).then(() => {

            // Log Benchmark
            functions.logger.info(`[BATCH] 'Write App Config Doc' took ${Date.now() - startTimeMs}ms`);


            // Get all documents in app subcollection
            startTimeMs = Date.now();

            return db.collection(`/app/${fromAppKey}/data`).get();

        }).then(appDataQS => {

            // Log Benchmark
            functions.logger.info(`[BATCH] 'Read App Data' took ${Date.now() - startTimeMs}ms`);


            // Batch up documents and write to new app subcollection
            startTimeMs = Date.now();

            let commits = [];
            let bDocCtr = 0;
            let batch = db.batch();

            appDataQS.forEach(docSnap => {

                let doc = docSnap.data();
                let docKey = docSnap.id;
                docs++;

                let docRef = db.collection(`/app/${toAppKey}/data`).doc(docKey);

                batch.set(docRef, doc);
                bDocCtr++

                if (bDocCtr >= 500) {
                    commits.push(batch.commit());
                    batch = db.batch();
                    bDocCtr = 0;
                }

            });

            if (bDocCtr > 0) commits.push(batch.commit());

            Promise.all(commits).then(results => {
                // Log Benchmark
                functions.logger.info(`[BATCH] 'Write App Data - ${docs} docs / ${commits.length} batches' took ${Date.now() - startTimeMs}ms`);
                resolve(results);
            });
         
        }).catch(err => {
            reject(err);
        });

    });

});


// Parallel Individual Writes
exports.cloneAppNoBatch = functions.https.onCall((data, context) => {

    return new Promise((resolve, reject) => {

        let fromAppKey = data.appKey;
        let toAppKey = db.collection('/app').doc().id;


        // Clone/copy data from one app subcollection to another
        let startTimeMs = Date.now();
        let docs = 0;

        // Write the app document (and ensure cold start doesn't affect timings below)
        db.collection('/app').doc(toAppKey).set({ desc: 'New App' }).then(() => {

            // Log Benchmark
            functions.logger.info(`[INDIVIDUAL] 'Write App Config Doc' took ${Date.now() - startTimeMs}ms`);


            // Get all documents in app subcollection
            startTimeMs = Date.now();

            return db.collection(`/app/${fromAppKey}/data`).get();

        }).then(appDataQS => {

            // Log Benchmark
            functions.logger.info(`[INDIVIDUAL] 'Read App Data' took ${Date.now() - startTimeMs}ms`);


            // Gather up documents and write to new app subcollection
            startTimeMs = Date.now();

            let commits = [];

            appDataQS.forEach(docSnap => {

                let doc = docSnap.data();
                let docKey = docSnap.id;
                docs++;
                    
                // Parallel individual writes
                commits.push(db.collection(`/app/${toAppKey}/data`).doc(docKey).set(doc));
        
            });

            Promise.all(commits).then(results => {
                // Log Benchmark
                functions.logger.info(`[INDIVIDUAL] 'Write App Data - ${docs} docs' took ${Date.now() - startTimeMs}ms`);
                resolve(results);
            });
         
        }).catch(err => {
            reject(err);
        });

    });

});

The specific results were (average of 3 runs each):

Batch Writes:

Read 1200 docs - 2.4 secs / Write 1200 docs - 1.8 secs

Individual Writes:

Read 1200 docs - 2.4 secs / Write 1200 docs - 10.5 secs

Note: These results are a lot better than what I was getting the other day - maybe Google was having a bad day - but the relative performance between batch and individual writes remains the same. Would be good to see if anyone else has had a similar experience.

Illyrian answered 7/9, 2020 at 2:27 Comment(3)
With Parallel Batch Writes how can you keep things atomic between batches? When one of the batch writes does fail how can you know what data didn't get updated (in this case) in Firestore?Khalsa
@Khalsa My example was only concerned with raw speed as I was copying an entire set of data to a new collection, and would remove any written data upon a failure of one of the batches (essentially I would clear the target collection).Illyrian
@Khalsa The reality is that, in Firestore, if you need to rollback sets of transactions with greater than 500 writes, you'll have to track and remove any successful batches upon any other batch failure. I can't think of any good solution if your batch is overwriting existing data, other than perhaps backing up that data first - say, in a temporary collection - and then reverting to it afterward if one of the batches failed. The solution really depends on that nature of what you are trying to do.Illyrian
T
1

I was working on benchmarking Firestore (with Node) when inserting data. Here are my findings:

In details:

  • The fastest way to insert data is by using Firestore batches, with around 100 documents per batch. However, we should be careful about:
  • The best way to insert a lot of data is by using the bulk writer. With this, I successfully inserted 200 000 docs in 208 seconds. From what I understand, the bulk writer autonomously regulates the insertion rate.
Tacklind answered 27/3 at 0:29 Comment(0)
S
0

I came across this little library that implements the parallelized batch operations @DG mentioned: https://github.com/stpch/firestore-multibatch. It provides a simple interface, so you can keep adding to the batch without worrying about the 500 op limit.

Stay answered 25/9, 2021 at 1:53 Comment(2)
This seems like an incorrect solution. It's just putting a wrapper around committing multiple batches in parallel for you. E.g. if you have 1000 documents, it will execute two 500 doc batches for you. However, this should only work if the document references are in different collections. There is still a 500 document writes per second limit on an individual collection (where document contains a sequential field) ("Soft Limits" firebase.google.com/docs/firestore/quotas), so the example in the GitHub README is against Firestore's soft limit.Marissamarist
Thanks for pointing this out! Good to know. For us, we're migrating off Firestore because we've been too disappointed by limits and poor DX ergonomics. Gotcha's like this are a pretty good example.Stay
E
0

Use Firestore's batch functionality to write multiple documents in a single request:

Initialize Firestore:

const { Firestore } = require('@google-cloud/firestore');
const firestore = new Firestore();

Create a batch and add write operations:

const batch = firestore.batch();

const data = [...]; // Your array of documents
data.forEach((doc, index) => {
  const docRef = firestore.collection('your-collection').doc(`doc-${index}`);
  batch.set(docRef, doc);
});

Commit the batch:

await batch.commit();
Else answered 19/7 at 7:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.