Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
Asked Answered
W

1

7

I'm trying to parse and insert a big csv file into MongoDB but when the file extends 100'000 rows I get a bad response from the server. And the files I need to insert are usually above 200'000 rows.

I've tried both bulk insert (insertMany) and Babyparse(Papaparse) streaming approach to insert the file row by row. But with poor results.

Node api:

router.post('/csv-upload/:id', multipartMiddleware, function(req, res) {

    // Post vartiables
    var fileId = req.params.id;
    var csv = req.files.files.path;

    // create a queue object with concurrency 5
    var q = async.queue(function(row, callback) {
        var entry = new Entry(row);
        entry.save();
        callback();
    }, 5);

    baby.parseFiles(csv, {
        header: true, // Includes header in JSON
        skipEmptyLines: true,
        fastMode: true,
        step: function(results, parser) {
            results.data[0].id = fileId;

            q.push(results.data[0], function (err) {
                if (err) {throw err};
            });
        },
        complete: function(results, file) {
            console.log("Parsing complete:", results, file);
            q.drain = function() {
                console.log('All items have been processed');
                res.send("Completed!");
            };
        }
    });
});

This streaming approach results in: POST SERVER net::ERR_EMPTY_RESPONSE

Not sure if I'm using the async.queue correctly though.

Is there a better and more efficient way to do this OR am I doing something wrong?

Express Server:

// Dependencies
var express = require('express');
var path = require('path');
var bodyParser = require('body-parser');
var routes = require('./server/routes');
var mongoose = require("mongoose");
var babel = require("babel-core/register");
var compression = require('compression');
var PORT = process.env.PORT || 3000;
// Include the cluster module
var cluster = require('cluster');

mongoose.connect(process.env.MONGOLAB_URI || 'mongodb://localhost/routes');

  // Code to run if we're in the master process
 if (cluster.isMaster) {

    // Count the machine's CPUs
    var cpuCount = require('os').cpus().length;

    // Create a worker for each CPU
    for (var i = 0; i < cpuCount; i += 1) {
        cluster.fork();
    }

 // Code to run if we're in a worker process
 } else {
    // Express
    var app = express();

    app.use(bodyParser.json({limit: '50mb'}));
    app.use(bodyParser.urlencoded({limit: '50mb', extended: true}));

    // Compress responses
    app.use(compression());

    // Used for production build
    app.use(express.static(path.join(__dirname, 'public')));

    routes(app);

    // Routes
    app.use('/api', require('./server/routes/api'));

    app.all('/*', function(req, res) {
        res.sendFile(path.join(__dirname, 'public/index.html'));
    });

    // Start server
    app.listen(PORT, function() {
        console.log('Server ' + cluster.worker.id + ' running on ' + PORT);
    });
}
Woodwind answered 25/1, 2017 at 17:36 Comment(4)
this solution i did for another question might be of help to you - #32386618 - the OP reported loading 700k records from csv in about 1 minuteIq
Thank you @Robbie! I will take a look at that.Woodwind
That thread helped me a lot @Iq , Thank you.Woodwind
glad it helped youIq
S
5

Handling the import:

Great question, from my experience by far the fastest way to insert a csv into mongo is via the command line:

mongoimport -d db_name -c collection_name --type csv --file file.csv --headerline 

I don't believe mongoose has a way of calling mongoimport (someone correct me if I'm wrong)

But it's simple enough to call via node directly:

var exec = require('child_process').exec;
var cmd = 'mongoimport -d db_name -c collection_name --type csv --file file.csv --headerline';

exec(cmd, function(error, stdout, stderr) {
  // do whatever you need during the callback
});

The above will have to be modified to be dynamic, but it should be self-explanatory.

Handling the upload:

Uploading the file from a front-end client is another challenge.

Most browsers will timeout if you make a request to a server and don't get a response within 60 seconds (probably what you are referring to above)

One solution would be to open a socket connection (search for socket.io in npm) for details. This will create a constant connection to the server and won't be subject to the timeout restrictions.

If uploading is not an issue, and the timeout is due to the slow parsing/inserting then you may not have to worry about this once you implement the above.

Other considerations:

I'm not sure exactly what you need to send back to the user, or what parsing needs to take place. But that can either be done outside of the normal request/response cycle, or can be handled during a socket connection if it's needed during one request/response cycle.

Sylvestersylvia answered 25/1, 2017 at 18:9 Comment(4)
Thank you for your reply. That sounds interesting. To upload the file is not a problem right now. Do you have any idea how the mongoimport will handle the csv because the file will always look something like this: version task concept price 1 1 3 1 1 1 2 2 1 1 1 1 etc etc etc...Woodwind
To see how it handles the import I would just test with a sample csv. Put the csv on your server and run the mongoimport command mentioned above referencing your sample csv. Then open mongo via the command line and view the data that was inserted. Also, the mongo documentation is a great reference docs.mongodb.com/manual/reference/program/mongoimportSylvestersylvia
Thanks. Another problem with this approach is that I need to add the id dynamically because that is not included in the file.Woodwind
How are you using the id? Is each id a different collection, or is the id a field you add within the csv. If it's the former then use the id as the collection name, if it's the second option you could do a simple mongoose update in the callback of the mongoimport exec command.Sylvestersylvia

© 2022 - 2024 — McMap. All rights reserved.