Let me preface the question by saying that I understand that there will probably not be a definitive, yes-or-no answer on this topic and that answers given might be opinion driven. However, I do require and am thankful for advice and/or guidance in deploying and operating the following API design.

What I am working on

For a SaaS of mine, I would like to provide its functionality to my customers via an API. The task the SaaS is providing is a long running, computationally expensive one. So, unfortunately, running a straightforward, "synchronous" API where the caller waits for the results of this task to be delivered as the response to their request is not suitable. Instead, I have settled on an approach where the caller schedules Jobs and periodically queries the API to see if a given job has finished yet. I'm quite happy with that design.

What I have done so far

For the implementation, I have built on / used the following technologies:

Node.js server
Express npm package for API and routing
Redis database for job scheduling
bullmq npm package for Redis connection and queue management

Using bullmq, I create a Queue and add Jobs to that queue in response to a call to a given endpoint.

For me, this is in api.ts (shortened):

import Express from "express"
import { Queue } from "bullmq"

const api = Express()
const queue = new Queue("com.mysaas.workerQueue")

api.post("/job", async (request, response) => {
    var data, jobId
    ...

    await queue.add("com.mysaas.defaultJob", data, {
        jobId: jobId,
        removeOnComplete: true,
        ...
    })

    response.send({
        status: "success",
        job: jobId
    })
}
...

Additionally, I create workers which process scheduled jobs, coordinated by the bullmq package, using Redis under the hood.

For me, this is in worker.ts (shortened):

import { Worker } from "bullmq"

const worker = new Worker("com.mysaas.workerQueue", async (job) => {
    await someWork()
    ...
}, {concurrency: 25})

How I run it currently in my dev environment

tsc to compile Typescript to Javascript.

I use pm2 as a process manager and daemon. To run the api I use the following command:

pm2 start -i max build/api/api.js

This will start one node process for each cpu core available (8 on my dev machine).

To run a worker, I open another terminal and execute:

node build/worker/worker.js

I can add workers by opening more terminals and repeating the command above. Doing so will cause multiple workers to take Jobs from the Queue, sharing the workload and finishing multiple jobs concurrently and thus more quickly.

All of this works quite nicely in my dev environment.

What I am unsure about

What I don't know is if the current approach is suitable for the production environment. I have studied the documentation for pm2 and bullmq but I can't seem to find a definitive description of how to combine those two. My target being, of course, maximising performance and API throughput.

The following points are still open issues in my head:

Will pm2 actually run one process on each available core, i.e. assign a certain core to a given process? Or will it rather spawn as many processes as there are cores and the lower-level system mechanics will load-balance which core's computation time is used for which process.
If the former, will child processes I launch from an instance of my API process be assigned to the core the parent process is running on?
Launching a worker as mentioned above, we can provide a concurrency option, which will tell bullmq to spawn as many processes to run the worker code in. Is that concurrency also limited to the core assigned to the main initial worker process (started with node build/worker/worker.js).

Given my uncertainty in the above points, I have distilled three approaches to running the processes in the production environment:

Start API processes and daemonise with pm2 start -i max build/api/api.js. Start worker processes and daemonise with pm2 start -i max build/worker/worker.js. This would leave me with 1 api process and 1 worker process per available core. I would assume this approach optimises cpu load. However, I have no idea what value to assign to the concurrency parameter for the bullmq worker then.
Move/Import the worker code to the API process. Start API processes and daemonise with pm2 start -i max build/api/api.js. This would leave me with 1 process, which includes api and worker, per core. The downside here seems to me that worker and api are the same process, meaning that if the api part causes the process to crash for some reason, the worker would shut down with it. Is that correct?
Start API processes and daemonise with pm2 start -i max build/api/api.js. Start and demonise a single worker process with a high value (100+) for the concurrency parameter. Assuming the concurrent worker processes get load-balanced to the available cores by the system, this should also result in optimised cpu utilisation. The downside being that if the single worker process crashes, no work is done until it is restored.

My gut feeling tells me approach 1. is best in my scenario. However, this is the first time I am deploying a node.js application of this sort to production. Therefore I sincerely appreciate any help, advise, guidance or feedback on the described project.

Update 03/2022

We ended up going for a much more bootstrapped approach than deploying everything by hand with pm2, etc.

The overall API concept stayed the same (schedule job, poll job status), but we switched to nestjs as the main driver and central api framework. The nest app is split in api and worker modules. Both get deployed to Heroku as separately scalable processes (Heroku dynos). We ended up using the Heroku Redis Add-On for job scheduling with bullmq.

To be honest, after making the switch to Heroku, the only thing we did in regards to the worker concurrency and CPU core questions I posed above, was to utilise the Heroku-provided WEB_CONCURRENCY environment variable to spawn the appropriate amount of worker processes on each dyno.

The code really ended up as simple as this (using the throng package in our entry-point .js file):

const bootstrap = async () => {
    const app = await NestFactory.create(WorkerMainModule)
}

throng({ workers: process.env.WEB_CONCURRENCY, worker: bootstrap })

What I am working on

What I have done so far

How I run it currently in my dev environment

What I am unsure about

Update 03/2022

Recommended topics

Hot tags