Node.js Large File Uploads to MongoDB blocking the Event Loop and Worker Pool
Asked Answered
F

4

6

So I want to upload large CSV files to a mongoDB cloud database using a Node.js server using Express, Mongoose and Multer's GridFS storage engine, but when the file upload starts, my database becomes unable to handle any other API requests. For example, if a different client requests to get a user from the database while the file is being uploaded, the server will recieve the request and try to fetch the user from the MongoDB cloud, but the request will get stuck because the large file upload eats up all the computational resources. As a result, the get request performed by the client will not return the user until the file upload that is in progress is completed.

I understand that if a thread is taking a long time to execute a callback (Event loop) or a task (Worker), then it is considered "blocked" and that Node.js runs JavaScript code in the Event Loop while it offers a Worker Pool to handle expensive tasks like file I/O. I've read on this blog post by NodeJs.org that in order to keep your Node.js server speedy, the work associated with each client at any given time must be "small" and that my goal should be to minimize the variation in Task times. The reasoning behing this is that if a Worker's current Task is much more expensive than other Tasks, it will be unavailable to work on other pending Tasks, thus decreasing the size of the Worker Pool by one, until the Task is completed.

In other words, the client performing the large file upload is executing an expensive Task that decreases the throughput of the Worker Pool, in turn decreasing the throughput of the server. According to the aforementioned blog post, when each sub-task completes it should submit the next sub-Task, and when the final sub-Task is done, it should notify the submitter. This way, between each sub-Task of the long Task (the large file upload), the Worker can work on a sub-Task from a shorter Task, thus solving the blocking problem.

However, I do not know how to implement this solution in actual code. Are there any specific partioned functions that can solve this problem? Do I have to use a specific upload architecture or a node package other than multer-gridfs-storage to upload my files? Please help

Here is my current file upload implementation using Multer's GridFS storage engine:

   // Adjust how files get stored.
   const storage = new GridFsStorage({
       // The DB connection
       db: globalConnection, 
       // The file's storage configurations.
       file: (req, file) => {
           ...
           // Return the file's data to the file property.
           return fileData;
       }
   });

   // Configure a strategy for uploading files.
   const datasetUpload = multer({ 
       // Set the storage strategy.
       storage: storage,

       // Set the size limits for uploading a file to 300MB.
       limits: { fileSize: 1024 * 1024 * 300 },
    
       // Set the file filter.
       fileFilter: fileFilter,
   });


   // Upload a dataset file.
   router.post('/add/dataset', async (req, res)=>{
       // Begin the file upload.
       datasetUpload.single('file')(req, res, function (err) {
           // Get the parsed file from multer.
           const file = req.file;
           // Upload Success. 
           return res.status(200).send(file);
       });
   });

Fessler answered 10/5, 2022 at 12:41 Comment(0)
F
3

So after a couple of days of research, I found out that the root of the problem wasn't Node.JS or my file upload implementation. The problem was that MongoDB Atlas couldn't handle the file upload workload at the same time as other operations such as fetching users from my database. As I've stated in the question post, Node.js was receiving API calls from other clients as it should be, but they weren't returning any results. I now realize that was because they were getting stuck at the DB level. Once I switched to a local deployment of MongoDB, the problem was resolved.

According to this blog post about MongoDB Best Practices the total number of active threads (i.e., concurrent operations) relative to the number of CPUs can impact performance and therefore the throughput of the Node.js server. However, I've tried using dedicated MongoDB clusters with up to 8 vCPUs (the M50 cluster package) and MongoDB Atlas still could NOT upload the file while handling other client requests.

If someone made it work with a cloud solution I'd like to know more. Thank you.

Fessler answered 12/5, 2022 at 15:12 Comment(0)
E
2

I think this problem is sourced from the buffer. Because the buffer has to receive all chunks and then the entire buffer is sent to the consumer, so buffering takes a long time. Streams can solve this problem so streams allow us to process the data as soon as it arrives from the source and to do things that would not be possible by buffering data and processing it all at once. I found storage.fromStream() method on the multer GitHub page and tested it by uploading a 122 MB file, it worked for me, thanks to Node.js streams, every chunk of data is consumed and saved to the cloud database as soon as it is received. the total time of uploads had been less than 1 minute, and the server could easily respond to the other requests during the upload.

const {GridFsStorage} = require('multer-gridfs-storage');
const multer = require('multer');
const upload = multer({ dest: 'uploads/' });
const express = require('express');
const fs = require('fs');
const connectDb = require('./connect');
const app = express();
 
const storage = new GridFsStorage({db:connectDb()});

app.post('/profile', upload.single('file'), function (req, res, next) {
  const {file} = req;
  const stream = fs.createReadStream(file.path); //creates stream
  storage.fromStream(stream, req, file)
    .then(() => res.send('File uploaded')) //saves data as binary to cloud db
    .catch(() => res.status(500).send('error'));
});
app.get('/profile',(req,res)=>{
    res.send("hello");
})

app.listen(5000);

Encode answered 10/5, 2022 at 14:52 Comment(6)
While your approach seems to be working at the start, after a while during the upload my server crashes and gives me this error: ...\node_modules\mongoose\node_modules\mongodb\lib\utils.js:106 throw err; ^ TypeError: Cannot read property 'destroyed' of undefined at GridFSBucketWriteStream.Writable.destroy (internal/streams/writable.js:773:14). Do you have any idea what might be causing this?Fessler
@NikitasIO I need to see your code again to give some idea about itEncode
I basically copy-pasted the code you suggested in my own project. I noticed that after the upload crashes the server and prints the "TypeError: Cannot read property 'destroyed' of undefined" error, the blocking problem returns again for a while (the server becomes unresponsive to client requests during the upload again for a while). After some time passes, the server is able to receive client requests during the upload process again. However, during every large file upload (116 MB), the server crashes again and I get the same TypeError, bringing the blocking problem back.Fessler
@NikitasIO I don't think the blocking problem is related to the issue you mentioned. Because I checked with 122 mb.Encode
@NikitasIO just use the above code upload your file and check it, you also will see that this solution works fine for your issue, then compare it to your project after that you will know where the error you mentioned is sourcedEncode
Apparently the problem wasn't in the code but rather in the available computational resources provided by my MongoDB Atlas Cluster. I've posted a detailed explanation of the solution as the accepted answer if you'd like to take a look. Thank you for taking the time to help me. Greatly appreciated.Fessler
I
1

I was having a similar issue, and what I did to solve this (in some way) was to implement multiple connections for MongoDB.

So the upload operation will be handled by a new MongoDB connection and during the uploading process you could still query the database using another connection. https://thecodebarbarian.com/slow-trains-in-mongodb-and-nodejs

Impurity answered 13/4, 2023 at 0:42 Comment(0)
M
0

Can you manage architecture/infrastructure? If so, this challenge would be best solved by different approach. This is actually perfect candidate for serverless solution, i.e. Lambda.

Lambda does not run any requests on one machine in parallel. Lambda assign one request to one machine and until the request is finished this machine will not receive any other traffic. Therefore you will never hit the limits you are encountering now.

Moth answered 10/5, 2022 at 12:51 Comment(1)
Thank you for the suggestion, but I have to implement this without relying on AWS and solutions like Lambda. I am specifically looking for an answer that involves code for partitioning the upload process.Fessler

© 2022 - 2024 — McMap. All rights reserved.