On my Node.JS app I'm using clusters to utilize my multi-core CPU. I'm using the node's mariasql library to communicate with my database. Since the node-mariasql
library does not support pooling, I am using the third party - generic-pool to maintain a pool of connections.
I've noticed that our CPU usage goes upto 100% whenever a connection in the master thread is closed after an uncaught exception causes one of the child clusters to restart.
Whenever a child cluster restarts, I am destroying all MySQL connections.
Node version - v4.2.2
MariaDB version - v10.0.15
node-mariasql version - v0.2.5
Reproducible code - https://github.com/bsurendrakumar/node-simplex/
Code snippets
Creating the pool of connections ...
var pool = poolModule.Pool({
name: 'mariadb',
create: function(callback) {
var client = new mSQLClient();
client.connect(dbConfig);
client.on('error', function(err) {
callback(err, null);
});
client.on('ready', function() {
callback(null, client);
});
},
destroy: function(client) {
if(cluster.isMaster) {
console.log('Destroying / ending master thread ID -', client.threadId);
}
if(isDraining) {
client.destroy();
} else {
client.end();
}
},
max: dbConfig.maxConn,
min: dbConfig.minConn,
idleTimeoutMillis: dbConfig.idleTimeout
});
On the master thread ...
console.log('------------------------------------');
console.log('Master Process ID:', process.pid);
console.log('------------------------------------\n\n');
console.log('Creating an extra DB connection on the master thread.\n\n');
getCountries();
// Create a worker for each CPU
for (var i = 0; i < cpuCount; i += 1) {
cluster.fork();
}
// Restarting the thread if something exits...
cluster.on('exit', function () {
cluster.fork();
});
Once an uncaught exception happens...
// Handle uncaught exceptions...
process.on('uncaughtException', function (err) {
try {
console.log('\n--------------');
console.log(err);
// Stop the HTTP Server
console.log('\n--------------');
console.log('Encountered uncaught exception!');
console.log('Stopping HTTP server ...');
if(httpServer) {
httpServer.close();
}
console.log('Stopped HTTP server, performing cleanup ...');
// Call the cleanup function
cleanUp(function() {
// Exit!!
console.log('Cleanup done!');
restartProcess();
});
} catch (e) {
console.log(e);
restartProcess();
}
function restartProcess() {
console.log('Restarting process ...');
process.exit(1);
}
});
The cleanup function ...
function cleanUp(cbMain) {
isDraining = true;
if(pool.hasOwnProperty('_inUseObjects')
&& Array.isArray(pool._inUseObjects)
&& pool._inUseObjects.length > 0) {
let inUseObjs = pool._inUseObjects;
let inUseObjsLen = pool._inUseObjects.length;
for(let i = 0; i !== inUseObjsLen; ++i) {
inUseObjs[0].destroy();
pool.release(inUseObjs[0]);
}
}
pool.drain(function() {
pool.destroyAllNow(function() {
return cbMain();
});
});
}
The minimum number of connections in the pool is set to 5. All its configuration can be found under here. So when the server starts, generic pool will kick of 5 connections to MySQL and keep them in its pool.
The idleTimeout
for an object in the pool has been set to 120 seconds. This means that if there are more than 5 (since 5 is the minimum) objects in the pool and one of them has not been used for the last 120 seconds, it'll be destroyed.
At server startup, I am making a simple call to our country model to fetch the list of countries. This code is here. This establishes a new connection to the database, so now in the pool there'll be a 6 SQL connection in the pool and one of which will get cleaned after 120 seconds.
Following is the step by step process via which I believe that the issue is with our usage of the mariasql library -
- When the server is started, I am logging the process ID to the console. Grab the mater process ID, for example - 20584.
Look at the file descriptors being used by the process by using -
ls -l /proc/20584/fd
. Make a note of the socket connections. The output of this will look something like this -lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 12 -> socket:[2469914] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 13 -> socket:[2469917] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 14 -> socket:[2468106] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 15 -> socket:[2468109] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 17 -> socket:[2467206] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 18 -> socket:[2467208] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 19 -> socket:[2467210] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 2 -> /dev/tty lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 20 -> socket:[2467212] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 21 -> socket:[2467214] lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 22 -> socket:[2467306]
Copy few of the sockets numbers for example 2467212, and run
lsof | grep 2467212
. You'll notice that these are connections to the MySQL server. The output of that should be something like -node 20584 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED) V8 20584 20585 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED) V8 20584 20586 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED) V8 20584 20587 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED) V8 20584 20588 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED)
Crash the server by going to http://127.0.0.1:3000/api/v1/country/list. This will crash one of the child processes. Whenever an uncaught exception occurs, I do some cleanup and exit. I then fork another process to take the place of the one that was just killed. Cleanup includes -
- Closing http server
- Closing MySQL connections in generic pool
- Closing winston logger streams.
- Wait for the MySQL connection in the master thread to be closed. When this happens, I am writing a log to the console -
Destroying / ending master thread ID - 4984
- Check your CPU usage, you'll notice that one of the CPU's has shot upto 100%.
- Next run,
strace -o log.txt -eepoll_ctl,epoll_wait -p 20584
. Note that you might need to install strace. This command logs all theepoll_ctl, epoll_wait
system calls made by the Node.JS process and puts it inside a file named log.txt the current working directory. Open the log.txt file, and you'll notice logs similar to these -
epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 847) = 1 epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor) epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 845) = 1 epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor) epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 843) = 1 epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor)
The file descriptor here is 16, and if you co-relate it with your earlier
ls -l /proc/20584/fd
andlsof | grep 2467212
, you'll realize that this belongs to the MySQL connection that was just closed.
This leads me to believe that somewhere, even when the connection to MySQL is released, there is a file descriptor hanging there, that is still being used. I've found various threads across forums with similar issues -