Node.js - Sending a big object to child_process is slow
Asked Answered
M

1

13

My Use-case is as follows: I make plenty of rest API calls from my node server to public APIs. Sometime the response is big and sometimes its small. My use-case demands me to stringify the response JSON. I know a big JSON as response is going to block my event loop. After some research i decided to use child_process.fork for parsing these responses, so that the other API calls need not wait. I tried sending a big 30 MB JSON file from my main process to the forked child_process. It takes so long for the child process to pick and parse the json. The response im expecting from the child process is not huge. I just want to stringify and get the length and send back to the main process.

Im attaching the master and child code.

var moment = require('moment');
var fs = require('fs');
var process = require('child_process');
var request = require('request');

var start_time = moment.utc().valueOf();

request({url: 'http://localhost:9009/bigjson'}, function (err, resp, body) {

  if (!err && resp.statusCode == 200) {

    console.log('Body Length : ' + body.length);

    var ls = process.fork("response_handler.js", 0);

    ls.on('message', function (message) {
        console.log(moment.utc().valueOf() - start_time);
        console.log(message);
    });
    ls.on('close', function (code) {
        console.log('child process exited with code ' + code);
    });
    ls.on('error', function (err) {
        console.log('Error : ' + err);
    });
    ls.on('exit', function (code, signal) {
        console.log('Exit : code : ' + code + ' signal : ' + signal);
    });
  }
  ls.send({content: body});
});

response_handler.js

console.log("Process " + process.argv[2] + " at work ");

process.on('message', function (json) {
  console.log('Before Parsing');
  var x = JSON.stringify(json);
  console.log('After Parsing');
  process.send({msg: 'Sending message from the child. total size is' +    x.length});
});

Is there a better way to achieve what im trying to do? On one hand i need the power of node.js to make 1000's of API calls per second, but sometimes i get a big JSON back which screws things up.

Mediative answered 21/11, 2015 at 10:57 Comment(5)
Your approach seems to be fine. When you say "my node server" I understand it to be a process that serves clients. Do you really need to do the API calls from inside your server? Can't you delegate the task to different processes and set a communication channel between them and your server like a message broker, Redis, or simply pipes or some other form of IPC?Debonair
My bad, for calling this as a server, you can consider this to be an agent. This is not serving anyone. This agent acts as a highly scalable API client.Mediative
Maybe you can using streaming json parser rather than doing it in one big block with JSON.Minnesota
@Minnesota how would a streaming parser be a performance improvement here?Debonair
@Debonair depends on what you need to do in the end and I haven't look at possible modules, but seems like if you might be able to parse the JSON asynchronously and if you don't need the whole obj at once might helpMinnesota
D
3

Your task seems to be both IO-bound (fetching 30MB sized JSON) where Node's asynchronicity shines, as well as CPU-bound (parsing 30MB sized JSON) where asynchronicity doesn't help you.

Forking too many processes soon becomes a resource hog and degrades performance. For CPU-bound tasks you need just as many processes as you have cores and no more.

I would use one separate process to do the fetching and delegate parsing to N other processes, where N is (at most) the number of your CPU cores minus 1 and use some form of IPC for the process communication.

One choice is to use Node's Cluster module to orchestrate all of the above: https://nodejs.org/docs/latest/api/cluster.html

Using this module, you can have a master process create your worker processes upfront and don't need to worry when to fork, how many processes to create, etc. IPC works as usual with process.send and process.on. So a possible workflow is:

  1. Application startup: master process creates a "fetcher" and N "parser" processes.
  2. fetcher is sent a work list of API endpoints to process and starts fetching JSON sending it back to master process.
  3. on every JSON fetched the master sends to a parser process. You could use them in a round-robin fashion or use a more sophisticated way of signalling to the master process when a parser work queue is empty or is running low.
  4. parser processes send the resulting JSON object back to master.

Note that IPC also has non-trivial overhead, especially when send/receiving large objects. You could even have the fetcher do the parsing of very small responses instead of passing them around to avoid this. "Small" here is probably < 32KB.

See also: Is it expensive/efficient to send data between processes in Node?

Debonair answered 21/11, 2015 at 11:59 Comment(6)
This is exactly what i was trying to do. The main thread is going to fetch all the contents from various APIs and the computation to be handled by separate processes. The problem here is passing the responses to these new processes. I have given an simplified example, where it looks like im going to create as many processes as the responses. But im certainly planning to restrict it to the number of cores. However im not getting how it can be orchestrated by node clusters. Let me know if im missing something here. If you have any pointers, please point me to that. Appreciate your response.Mediative
@Mediative I updated the answer. Take a look at the documentation for Cluster it is quite straightforward.Debonair
Ill try this. Looks like the assumption here is piping big json to parsers from master would not take as much time as it takes in my example .where im using child_process.fork(). Since node cluster internally uses child_process.fork() i wonder how it can behave different. Ill give it a shot to confirm. Appreciate the response.Mediative
@Mediative The process of forking is no different. The number of fork calls is. If you need to fork for every fetched JSON chunk then you save your application 1000s of unnecessary fork calls if you have your 2-3 worker processes started once and living for the lifetime of the application.Debonair
@kilron That was not what i was trying to do. I you have a look at my first response to this reply, i had stated that i was planning to restrict the number of child process to the number of cores and reuse the process.Mediative
@Mediative I see. The purpose of Cluster is to provide an easier integration of multiple processes. If you already have optimized to the number of CPUs and minimized the number of forks, I don't think you could do much more to speed things up. Passing large objects from process to process also incurs significant overhead (see my updated answer).Debonair

© 2022 - 2024 — McMap. All rights reserved.