High availability computing: How to deal with a non-returning system call, without risking false positives?

Asked 5/5, 2015 at 19:15 Answered 7/5, 2015 at 4:51

Solved linux high-availability failover heartbeat

I have a process that's running on a Linux computer as part of a high-availability system. The process has a main thread that receives requests from the other computers on the network and responds to them. There is also a heartbeat thread that sends out multicast heartbeat packets periodically, to let the other processes on the network know that this process is still alive and available -- if they don't heart any heartbeat packets from it for a while, one of them will assume this process has died and will take over its duties, so that the system as a whole can continue to work.

This all works pretty well, but the other day the entire system failed, and when I investigated why I found the following:

Due to (what is apparently) a bug in the box's Linux kernel, there was a kernel "oops" induced by a system call that this process's main thread made.
Because of the kernel "oops", the system call never returned, leaving the process's main thread permanently hung.
The heartbeat thread, OTOH, continue to operate correctly, which meant that the other nodes on the network never realized that this node had failed, and none of them stepped in to take over its duties... and so the requested tasks were not performed and the system's operation effectively halted.

My question is, is there an elegant solution that can handle this sort of failure? (Obviously one thing to do is fix the Linux kernel so it doesn't "oops", but given the complexity of the Linux kernel, it would be nice if my software could handle future other kernel bugs more gracefully as well).

One solution I don't like would be to put the heartbeat generator into the main thread, rather than running it as a separate thread, or in some other way tie it to the main thread so that if the main thread gets hung up indefinitely, heartbeats won't get sent. The reason I don't like this solution is because the main thread is not a real-time thread, and so doing this would introduce the possibility of occasional false-positives where a slow-to-complete operation was mistaken for a node failure. I'd like to avoid false positives if I can.

Ideally there would be some way to ensure that a failed syscall either returns an error code, or if that's not possible, crashes my process; either of those would halt the generation of heartbeat packets and allow a failover to proceed. Is there any way to do that, or does an unreliable kernel doom my user process to unreliability as well?

Crankpin answered 5/5, 2015 at 19:15 Comment(2)

"introduce the possibility of occasional false-positives where a slow-to-complete operation was mistaken for a node failure" - I'm not a specialist in high-availability computing, so this might be misguided, but that sounds like it should be treated very similarly to a node failure. In particular, it sounds like you'd want the other nodes to start handling its work. – Painterly 5/5, 2015 at 19:21

If all of the work were logically independent, I'd agree, but in this system the node has a particular role to play, and having two nodes playing the same role at the same time is something that would cause confusion, especially when the original node finished its long task only to find someone other node trying to take its place. (That situation wouldn't be fatal to the system, but it's something I'd like to avoid unless there is an actual hardware failure) – Crankpin 5/5, 2015 at 19:32

My second suggestion is to use ptrace to find the current instruction pointer. You can have a parent thread that ptraces your process and interrupts it every second to check the current RIP value. This is somewhat complex, so I've written a demonstration program: (x86_64 only, but that should be fixable by changing the register names.)

#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <linux/ptrace.h>
#include <sys/user.h>
#include <time.h>

// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)

int main_thread(void *ptr) {
    // "main" thread is now running under the monitor
    printf("Hello from main!");
    while (1) {
        int c = getchar();
        if (c == EOF) { break; }
        nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
        putchar(c);
    }
    return 0;
}

int main(int argc, char *argv[]) {
    void *vstack = malloc(STACK_SIZE);
    pid_t v;
    if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
        perror("failed to spawn child task");
        return 3;
    }
    printf("Target: %d; %d\n", v, getpid());
    long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
    if (ptv == -1) {
        perror("failed monitor sieze");
        exit(1);
    }
    struct user_regs_struct regs;
    fprintf(stderr, "beginning monitor...\n");
    while (1) {
        sleep(1);
        long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to interrupt main thread");
            break;
        }
        int status;
        if (waitpid(v, &status, __WCLONE) == -1) {
            perror("target wait failed");
            break;
        }
        if (!WIFSTOPPED(status)) { // this section is messy. do it better.
            fputs("target wait went wrong", stderr);
            break;
        }
        if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
            fputs("target wait went wrong (2)", stderr);
            break;
        }
        ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
        if (ptv == -1) {
            perror("failed to peek at registers of thread");
            break;
        }
        fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
        ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to resume main thread");
            break;
        }
    }
    return 2;
}

Note that this is not production-quality code. You'll need to do a bunch of fixing things up.

Based on this, you should be able to figure out whether or not the program counter is advancing, and could combine this with other pieces of information (such as /proc/PID/status) to find if it's busy in a system call. You might also be able to extend the usage of ptrace to check what system calls are being used, so that you can check if it's a reasonable one to be waiting on.

This is a hacky solution, but I don't think that you'll find a non-hacky solution for this problem. Despite the hackiness, I don't think (this is untested) that it would be particularly slow; my implementation pauses the monitored thread once per second for a very short amount of time - which I would guess would be in the 100s of microseconds range. That's around 0.01% efficiency loss, theoretically.

Carrara answered 7/5, 2015 at 4:51 Comment(0)

I think you need a shared activity marker.

Have the main thread (or in a more general application, all worker threads) update the shared activity marker with the current time (or clock tick, e.g. by computing the "current" nanosecond from clock_gettime(CLOCK_MONOTONIC, ...)), and have the heartbeat thread periodically check when this activity marker was last updated, cancelling itself (and thus stopping the heartbeat broadcast) if there has not been any activity update within a reasonable time.

This scheme can easily be extended with a state flag if the workload is very sporadic. The main work thread sets the flag and updates the activity marker when it begins a unit of work, and clears the flag when the work has completed. If there is no work being done then the heartbeat is sent without checking the activity marker. If work is being done then the heartbeat is stopped if the time since the activity marker was updated exceeds the maximum processing time allowed for a unit of work. (Multiple worker threads each need their own activity marker and flag in this case, and the heartbeat thread can be designed to stop when any one worker thread gets stuck, or only when all worker threads get stuck, depending on their purposes and importance to the overall system).

(The activity marker value (and the work flag) will of course have to be protected by a mutex that must be acquired before reading or writing the value.)

Perhaps the heartbeat thread can also cause the whole process to commit suicide (e.g. kill(getpid(), SIGQUIT)) so that it can be restarted by having it be called in a loop in a wrapper script, especially if a process restart clears the condition in the kernel which would cause the problem in the first place.

Extrabold answered 6/5, 2015 at 23:23 Comment(10)

The problem is that I can't give a reliable upper bound on the maximum processing time allowed for a unit of work. – Crankpin 6/5, 2015 at 23:33

You can't??? I think you may need to re-design your whole concept of what it means to know when your main thread is alive. If a unit of work can take an unbounded amount of time then you can never be sure that it will respond to new work requests! – Extrabold 6/5, 2015 at 23:41

It's okay for it to take a long time, if that's what the user intended. What's not okay is for the thread's execution to be halted (e.g. due to a kernel oops) and never complete at all (note: this is not the same thing as crashing, alas). Also not so good is introducing a fix that causes the system to fail (from the user's point of view) if the user's task takes longer than some arbitrarily specified amount of time to complete. – Crankpin 6/5, 2015 at 23:51

You're straying from the scenario you described in your question! Your worker thread is not available while working if it can take an unbounded amount of time to do some unit of work. What are you really asking about? On what basis did you decide the heartbeat interval??? – Extrabold 6/5, 2015 at 23:56

It's okay if subsequent requests have to wait until the first request completes, because due to the nature of the system, only one request can be processed at a time anyway. What I'm trying to avoid is the scenario described in my question, where a kernel oops causes the processing to halt permanently, but that failure does not cause a crash, and thus the failure is not detected and no failover to the backup hardware occurs. I'd like to do that without adding timeouts, if possible, since adding a timeout means there's a risk of making the timeout too short or too long. – Crankpin 7/5, 2015 at 1:8

Well any non-infinite timeout is shorter than an infinite hang..... How would /should the user detect the hang now? What is this stuck system call? How many, and what, system calls does the main thread do while processing one unit of work? How long should the system call which gets stuck be expected to take normally? (BTW, when I said "unit of work" I did not necessarily mean "the whole job" for a given request.) – Extrabold 7/5, 2015 at 1:18

Is it at all possible to dynamically determine the upper bound on the time based on the task being executed? – Carrara 7/5, 2015 at 1:28

col6y I could measure it empirically, but that only gives me the time for that particular task, under that particular load, on that particular machine, at that particular time. That might be "good enough" if I then multiply by measurements by 5 or something, but it wouldn't give me a whole lot of confidence since it's not a real-time task and the tasks will likely be different in the future anyway. – Crankpin 7/5, 2015 at 2:13

Greg the user can't easily detect the hang now; all they can do is notice that the thread is not responding, and is at 0% CPU usage, and if they read the kernel log they will see the "oops" message at the same time the thread stopped functioning. It's not clear which system call is the culprit, but I think it is a disk I/O call as the stack trace of the 'oops' implicated the filesystem. – Crankpin 7/5, 2015 at 2:17

#1 priority -- SIGQUIT the process and get a core dump and find out where each thread is sitting!!! (compile with -g, and ideally with -O0 first!) #2 think about whether an inner processing loop in the work thread can tickle an activity timer. – Extrabold 7/5, 2015 at 2:27

#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <linux/ptrace.h>
#include <sys/user.h>
#include <time.h>

// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)

int main_thread(void *ptr) {
    // "main" thread is now running under the monitor
    printf("Hello from main!");
    while (1) {
        int c = getchar();
        if (c == EOF) { break; }
        nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
        putchar(c);
    }
    return 0;
}

int main(int argc, char *argv[]) {
    void *vstack = malloc(STACK_SIZE);
    pid_t v;
    if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
        perror("failed to spawn child task");
        return 3;
    }
    printf("Target: %d; %d\n", v, getpid());
    long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
    if (ptv == -1) {
        perror("failed monitor sieze");
        exit(1);
    }
    struct user_regs_struct regs;
    fprintf(stderr, "beginning monitor...\n");
    while (1) {
        sleep(1);
        long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to interrupt main thread");
            break;
        }
        int status;
        if (waitpid(v, &status, __WCLONE) == -1) {
            perror("target wait failed");
            break;
        }
        if (!WIFSTOPPED(status)) { // this section is messy. do it better.
            fputs("target wait went wrong", stderr);
            break;
        }
        if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
            fputs("target wait went wrong (2)", stderr);
            break;
        }
        ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
        if (ptv == -1) {
            perror("failed to peek at registers of thread");
            break;
        }
        fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
        ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to resume main thread");
            break;
        }
    }
    return 2;
}

Note that this is not production-quality code. You'll need to do a bunch of fixing things up.

Carrara answered 7/5, 2015 at 4:51 Comment(0)

One possible method would be to have another set of heartbeat messages from the main thread to the heartbeat thread. If it stops receiving messages for a certain amount of time, it stops sending them out as well. (And could try other recovery such as restarting the process.)

To solve the issue of the main thread actually just being in a long sleep, have a (properly-synchronized) flag that the heartbeat thread sets when it has decided that the main thread must have failed - and the main thread should check this flag at appropriate times (e.g. after the potential wait) to make sure that it hasn't been reported as dead. If it has, it stops running, because its job would have already been taken up by a different node.

The main thread can also send I-am-alive events to the heartbeat thread at other times than once around the loop - for example, if it's going into a long-running operation. Without this, there's no way to tell the difference between a failed main thread and a sleeping main thread.

Carrara answered 6/5, 2015 at 5:44 Comment(6)

The problem I see with this approach is that it's a bit impractical to insert SendIAmAlive() calls into every possible routine the program might ever call (some of which are in third-party code, etc) – Crankpin 6/5, 2015 at 15:49

It doesn't have to go everywhere - just where it matters. I would assume that if your main process is handling network requests, there would be an obvious place each time around the loop where it waits for another request. This would be the obvious place to stick the NotifyIAmAlive() call. The only time it has to go elsewhere is if you have some really long operations elsewhere in the loop. – Carrara 6/5, 2015 at 19:12

The thing is, it's not always obvious where the really long operations will be, nor are they necessarily going to be in code that I am able to modify. Finding out that I missed a spot, in the form of having a customer call me and tell me that my "super-reliable" system indicates a hardware failure whenever they try to do operation X is not something I want to risk. – Crankpin 6/5, 2015 at 19:58

Hmm... so I suppose what you really want is a way to tell the difference between a crashed syscall and a really-long-running procedure call. Unfortunately, I don't think a mechanism for this exists (at least on the syscall level), which means that all I think you can do is find a way to guess about the nature of the pause. – Carrara 6/5, 2015 at 21:14

Maybe what I need is a way for the heartbeat thread to observe the main thread's program-counter register, to see if it ever moves... hmm, sounds tricky :) – Crankpin 6/5, 2015 at 22:12

You could use ptrace? But that could be slow. – Carrara 7/5, 2015 at 1:29

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags