kernel stills schedules code to run on isolated cores
Asked Answered
C

1

6

I have a system running Linux kernel 4.19.71 with Intel Xeon Platinum 8160 CPU, it features 24 physical cores and with 2 threads per core it makes 48 logical cores. I'm experimenting with virtualization (qemu and kvm) and would like to isolate a set of cores from OS and hypervisor, so that the cores run exclusively application code. So I added isolcpus= kernel directive:

isolcpus=1-23,25-47

However I'm still seeing that some kernel threads are scheduled on the cores I'm isolating, e.g. :

# ps -A -L -o pid,nlwp,tid,c,psr,comm |sort -n -k 5 | grep 27
  148    1   148  0  27 kworker/27:0-mm_percpu_wq
  149    1   149  0  27 kworker/27:0H-events_highpri
  267    1   267  0  27 kworker/27:1-mm_percpu_wq
  799    1   799  0  27 kworker/27:1H-events_highpri
...
#

The 5-th column is the processor (core) id, in this case it is 27, which according to isolcpus= above should not be disturbed by the kernel, however it runs kworker thread there.

Does it mean there are exceptions and the kernel is still allowed to schedule tasks on the isolated cores, or I'm missing something obvious?

Thanks.

Container answered 15/3, 2021 at 22:43 Comment(0)
A
1

I am also working on this issue and I haven't figured out a way to prevent those kernel threads from being scheduled on the isolated CPU set.

From the documentation of RedHat, it also doesn't seem to be feasible.

Isolating CPUs
You can isolate one or more CPUs from the scheduler with the isolcpus boot parameter. This prevents the scheduler from scheduling any user-space threads on this CPU.

I have been using a combination of isolcpus and cset shield in order to prevent the majority of kernel's housekeeping threads being scheduled in my isolated CPUs.

I have used perf sched in order to record the context switches on my CPUs and perf map in order to visualize them.

In the first experiment, having used only cset shield.

$ grep -e '=>' exp_1.sch
        *A0 445210.783227 secs A0 => kworker/11:1-ev:165
        *.  445210.783275 secs .  => swapper:0
        *B0 445210.783304 secs B0 => kworker/u24:4-e:130904
        *C0 445210.783420 secs C0 => WORKER2:160974
.   *D0  C0 445210.783844 secs D0 => kworker/10:0-ev:1672
*E0  .   C0 445210.784703 secs E0 => WORKER0:160969
*F0  .   C0 445210.789628 secs F0 => kworker/9:1-eve:163
 E0 *G0  .  445210.802886 secs G0 => WORKER1:160973
 E0 *H0  .  445210.811638 secs H0 => ksoftirqd/10:76
 E0 *I0  .  445210.939469 secs I0 => kworker/u24:2-e:158157
*J0  G0  .  445211.527639 secs J0 => ksoftirqd/9:70
 E0  G0 *K0 445212.087622 secs K0 => ksoftirqd/11:82
 E0 *L0  .  445212.347277 secs L0 => kworker/10:1H-k:277
*M0  I0  C0 445213.321971 secs M0 => kworker/u24:1-e:160121
 E0 *N0  .  445214.463593 secs N0 => migration/10:75
*O0  N0  .  445214.463597 secs O0 => migration/9:69
 O0  N0 *P0 445214.463598 secs P0 => migration/11:81
*Q0  G0  M0 445225.372366 secs Q0 => kworker/9:1H-kb:330

Here you may see my workload threads (WORKER{0,1,2}), the kworker threads (kworker/{9,10,11}:) corresponding to CPUs [9-11], and the rest ksoftirqd/{9,10,11}:, migration/{9,10,11}:, kworker/u24 and the "idle" thread swapper.

In the second experiment, I used cset shield with isolcpus.

$ grep -e '=>' exp_2.sch
*A0            1033.342241 secs A0 => WORKER0:3646
 A0     *B0    1033.342675 secs B0 => kworker/11:1-ev:165
 A0     *.     1033.342694 secs .  => swapper:0
 A0 *C0  .     1033.343470 secs C0 => WORKER1:3647
 A0  C0 *D0    1033.344634 secs D0 => WORKER2:3648
 A0 *E0  D0    1033.346306 secs E0 => kworker/10:1-ev:164
*F0  .   D0    1033.364736 secs F0 => kworker/9:1-eve:163
 A0 *G0  .     1036.433541 secs G0 => migration/10:75
*H0  G0  .     1036.433541 secs H0 => migration/9:69
 A0  G0 *I0    1036.433548 secs I0 => migration/11:81

In this case, you see only the WORKER{0,1,2}, kworker/{9,10,11}, migration/{9,10,11} and the swapper tasks.

Azal answered 8/2, 2023 at 11:6 Comment(1)
I think some things need to run on all cores, like for RCU to run_on every core since kernel code on the isolated CPU could still have been reading an RCU kernel variable. So this might be the best you can do.Schonthal

© 2022 - 2025 — McMap. All rights reserved.