Why does using taskset to run a multi-threaded Linux program on a set of isolated cores cause all threads to run on one core?

Desired behaviour: run a multi-threaded Linux program on a set of cores which have been isolated using isolcpus.

Here's a small program we can use as an example multi-threaded program:

#include <stdio.h>
#include <pthread.h>
#include <err.h>
#include <unistd.h>
#include <stdlib.h>

#define NTHR    16
#define TIME    60 * 5

void *
do_stuff(void *arg)
{
    int i = 0;

    (void) arg;
    while (1) {
        i += i;
        usleep(10000); /* dont dominate CPU */
    }
}

int
main(void)
{
    pthread_t   threads[NTHR];
    int     rv, i;

    for (i = 0; i < NTHR; i++) {
        rv = pthread_create(&threads[i], NULL, do_stuff, NULL);
        if (rv) {
            perror("pthread_create");
            return (EXIT_FAILURE);
        }
    }
    sleep(TIME);
    exit(EXIT_SUCCESS);
}

If I compile and run this on a kernel with no isolated CPUs, then the threads are spread out over my 4 CPUs. Good!

Now if I add isolcpus=2,3 to the kernel command line and reboot:

Running the program without taskset distributes threads over cores 0 and 1. This is expected as the default affinity mask now excludes cores 2 and 3.
Running with taskset -c 0,1 has the same effect. Good.
Running with taskset -c 2,3 causes all threads to go onto the same core (either core 2 or 3). This is undesired. Threads should distribute over cores 2 and 3. Right?

This post describes a similar issue (although the example given is farther away from the pthreads API). The OP was happy to workaround this by using a different scheduler. I'm not certain this is ideal for my use-case however.

Is there a way to have the threads distributed over the isolated cores using the default scheduler?

Is this a kernel bug which I should report?

EDIT:

The right thing does indeed happen if you use a real-time scheduler like the fifo scheduler. See man sched and man chrt for details.

From the Linux Kernel Parameter Doc:

This option can be used to specify one or more CPUs to isolate from the general SMP balancing and scheduling algorithms.

So this options would effectively prevent scheduler doing thread migration from one core to another less contended core (SMP balancing). As typical isolcpus are used together with pthread affinity control to pin threads with knowledge of CPU layout to gain predictable performance.

https://www.kernel.org/doc/Documentation/kernel-parameters.txt

--Edit--

Ok I see why you are confused. Yeah personally I would assume consistent behavior on this option. The problem lies around two functions, select_task_rq_fair and select_task_rq_rt, which is responsible for selecting new run_queue (which is essentially selecting which next_cpu to run on). I did a quick trace (Systemtap) of both functions, for CFS it would always return the same first core in the mask; for RT, it would return other cores. I haven't got a chance to look into the logic in each selection algorithm but you can send an email to the maintainer in Linux devel mailing list for fix.

Recommended topics

Hot tags