Let me preface the question by saying that I understand that there will probably not be a definitive, yes-or-no answer on this topic and that answers given might be opinion driven. However, I do require and am thankful for advice and/or guidance in deploying and operating the following API design.
What I am working on
For a SaaS of mine, I would like to provide its functionality to my customers via an API.
The task the SaaS is providing is a long running, computationally expensive one. So, unfortunately, running a straightforward, "synchronous" API where the caller waits for the results of this task to be delivered as the response to their request is not suitable.
Instead, I have settled on an approach where the caller schedules Jobs
and periodically queries the API to see if a given job has finished yet. I'm quite happy with that design.
What I have done so far
For the implementation, I have built on / used the following technologies:
Node.js
serverExpress
npm package for API and routingRedis
database for job schedulingbullmq
npm package for Redis connection and queue management
Using bullmq
, I create a Queue
and add Jobs
to that queue in response to a call to a given endpoint.
For me, this is in api.ts
(shortened):
import Express from "express"
import { Queue } from "bullmq"
const api = Express()
const queue = new Queue("com.mysaas.workerQueue")
api.post("/job", async (request, response) => {
var data, jobId
...
await queue.add("com.mysaas.defaultJob", data, {
jobId: jobId,
removeOnComplete: true,
...
})
response.send({
status: "success",
job: jobId
})
}
...
Additionally, I create workers which process scheduled jobs, coordinated by the bullmq
package, using Redis
under the hood.
For me, this is in worker.ts
(shortened):
import { Worker } from "bullmq"
const worker = new Worker("com.mysaas.workerQueue", async (job) => {
await someWork()
...
}, {concurrency: 25})
How I run it currently in my dev environment
tsc
to compile Typescript to Javascript.
I use pm2
as a process manager and daemon. To run the api I use the following command:
pm2 start -i max build/api/api.js
This will start one node process for each cpu core available (8 on my dev machine).
To run a worker, I open another terminal and execute:
node build/worker/worker.js
I can add workers by opening more terminals and repeating the command above.
Doing so will cause multiple workers to take Jobs
from the Queue
, sharing the workload and finishing multiple jobs concurrently and thus more quickly.
All of this works quite nicely in my dev environment.
What I am unsure about
What I don't know is if the current approach is suitable for the production environment. I have studied the documentation for pm2
and bullmq
but I can't seem to find a definitive description of how to combine those two. My target being, of course, maximising performance and API throughput.
The following points are still open issues in my head:
- Will
pm2
actually run one process on each available core, i.e. assign a certain core to a given process? Or will it rather spawn as many processes as there are cores and the lower-level system mechanics will load-balance which core's computation time is used for which process. - If the former, will child processes I launch from an instance of my API process be assigned to the core the parent process is running on?
- Launching a worker as mentioned above, we can provide a
concurrency
option, which will tell bullmq to spawn as many processes to run the worker code in. Is that concurrency also limited to the core assigned to the main initial worker process (started withnode build/worker/worker.js
).
Given my uncertainty in the above points, I have distilled three approaches to running the processes in the production environment:
Start API processes and daemonise with
pm2 start -i max build/api/api.js
. Start worker processes and daemonise withpm2 start -i max build/worker/worker.js
. This would leave me with 1 api process and 1 worker process per available core. I would assume this approach optimises cpu load. However, I have no idea what value to assign to theconcurrency
parameter for thebullmq
worker then.Move/Import the worker code to the API process. Start API processes and daemonise with
pm2 start -i max build/api/api.js
. This would leave me with 1 process, which includes api and worker, per core. The downside here seems to me that worker and api are the same process, meaning that if the api part causes the process to crash for some reason, the worker would shut down with it. Is that correct?Start API processes and daemonise with
pm2 start -i max build/api/api.js
. Start and demonise a single worker process with a high value (100+) for theconcurrency
parameter. Assuming the concurrent worker processes get load-balanced to the available cores by the system, this should also result in optimised cpu utilisation. The downside being that if the single worker process crashes, no work is done until it is restored.
My gut feeling tells me approach 1. is best in my scenario. However, this is the first time I am deploying a node.js application of this sort to production. Therefore I sincerely appreciate any help, advise, guidance or feedback on the described project.
Update 03/2022
We ended up going for a much more bootstrapped approach than deploying everything by hand with pm2
, etc.
The overall API concept stayed the same (schedule job, poll job status), but we switched to nestjs
as the main driver and central api framework. The nest app is split in api
and worker
modules. Both get deployed to Heroku as separately scalable processes (Heroku dynos). We ended up using the Heroku Redis Add-On for job scheduling with bullmq
.
To be honest, after making the switch to Heroku, the only thing we did in regards to the worker concurrency and CPU core questions I posed above, was to utilise the Heroku-provided WEB_CONCURRENCY
environment variable to spawn the appropriate amount of worker processes on each dyno.
The code really ended up as simple as this (using the throng
package in our entry-point .js file):
const bootstrap = async () => {
const app = await NestFactory.create(WorkerMainModule)
}
throng({ workers: process.env.WEB_CONCURRENCY, worker: bootstrap })