While debugging some performance issue in app I'm working on, I found out weird behaviour of kernel scheduler. It seems that busy SCHED_FIFO tasks tend to be scheduled on logical cores from the same physical CPU even though there are idle physical CPUs in the system.
8624 root -81 0 97.0g 49g 326m R 100 52.7 48:13.06 26 Worker0 <-- CPU 6 and 26
8629 root -81 0 97.0g 49g 326m R 100 52.7 44:56.26 6 Worker5 <-- the same physical core
8625 root -81 0 97.0g 49g 326m R 82 52.7 58:20.65 23 Worker1
8627 root -81 0 97.0g 49g 326m R 67 52.7 55:28.86 27 Worker3
8626 root -81 0 97.0g 49g 326m R 67 52.7 46:04.55 32 Worker2
8628 root -81 0 97.0g 49g 326m R 59 52.7 44:23.11 5 Worker4
Initially threads shuffle between cores, but at some point most CPU intensive threads ends up locked on the samey physical core and doesn't seem to move from there. There is no affinity set for Worker threads.
I tried to reproduce it with synthetic load by running 12 instances of:
chrt -f 10 yes > /dev/null &
And here is what I got:
25668 root -11 0 2876 752 656 R 100 0.0 0:17.86 20 yes
25663 root -11 0 2876 744 656 R 100 0.0 0:19.10 25 yes
25664 root -11 0 2876 752 656 R 100 0.0 0:18.79 6 yes
25665 root -11 0 2876 804 716 R 100 0.0 0:18.54 7 yes
25666 root -11 0 2876 748 656 R 100 0.0 0:18.31 8 yes
25667 root -11 0 2876 812 720 R 100 0.0 0:18.08 29 yes <--- core9
25669 root -11 0 2876 744 656 R 100 0.0 0:17.62 9 yes <--- core9
25670 root -11 0 2876 808 720 R 100 0.0 0:17.37 2 yes
25671 root -11 0 2876 748 656 R 100 0.0 0:17.15 23 yes <--- core3
25672 root -11 0 2876 804 712 R 100 0.0 0:16.94 4 yes
25674 root -11 0 2876 748 656 R 100 0.0 0:16.35 3 yes <--- core3
25673 root -11 0 2876 812 716 R 100 0.0 0:16.68 1 yes
This is server with 20 physical cores, so there is 8 remaining idle cores and threads are still scheduled on the same physical core. This is reproducible and persistent. It doesn't seem to happen for non-SCHED_FIFO threads. Also it started after migrating past kernel 4.19
Is this correct behaviour for SCHED_FIFO threads? Is there any flag or config option that can change this scheduler behaviour?