Why SCHED_FIFO threads are assigned to the same physical CPU even though idle CPUs are available?
Asked Answered
S

2

6

While debugging some performance issue in app I'm working on, I found out weird behaviour of kernel scheduler. It seems that busy SCHED_FIFO tasks tend to be scheduled on logical cores from the same physical CPU even though there are idle physical CPUs in the system.

 8624 root     -81   0 97.0g  49g 326m R  100 52.7  48:13.06 26 Worker0 <-- CPU 6 and 26 
 8629 root     -81   0 97.0g  49g 326m R  100 52.7  44:56.26  6 Worker5 <-- the same physical core
 8625 root     -81   0 97.0g  49g 326m R   82 52.7  58:20.65 23 Worker1
 8627 root     -81   0 97.0g  49g 326m R   67 52.7  55:28.86 27 Worker3
 8626 root     -81   0 97.0g  49g 326m R   67 52.7  46:04.55 32 Worker2
 8628 root     -81   0 97.0g  49g 326m R   59 52.7  44:23.11  5 Worker4

Initially threads shuffle between cores, but at some point most CPU intensive threads ends up locked on the samey physical core and doesn't seem to move from there. There is no affinity set for Worker threads.

I tried to reproduce it with synthetic load by running 12 instances of:

chrt -f 10 yes > /dev/null &

And here is what I got:

25668 root     -11   0  2876  752  656 R  100  0.0   0:17.86 20 yes
25663 root     -11   0  2876  744  656 R  100  0.0   0:19.10 25 yes
25664 root     -11   0  2876  752  656 R  100  0.0   0:18.79  6 yes
25665 root     -11   0  2876  804  716 R  100  0.0   0:18.54  7 yes
25666 root     -11   0  2876  748  656 R  100  0.0   0:18.31  8 yes
25667 root     -11   0  2876  812  720 R  100  0.0   0:18.08 29 yes <--- core9
25669 root     -11   0  2876  744  656 R  100  0.0   0:17.62  9 yes <--- core9
25670 root     -11   0  2876  808  720 R  100  0.0   0:17.37  2 yes 
25671 root     -11   0  2876  748  656 R  100  0.0   0:17.15 23 yes <--- core3
25672 root     -11   0  2876  804  712 R  100  0.0   0:16.94  4 yes
25674 root     -11   0  2876  748  656 R  100  0.0   0:16.35  3 yes <--- core3
25673 root     -11   0  2876  812  716 R  100  0.0   0:16.68  1 yes

This is server with 20 physical cores, so there is 8 remaining idle cores and threads are still scheduled on the same physical core. This is reproducible and persistent. It doesn't seem to happen for non-SCHED_FIFO threads. Also it started after migrating past kernel 4.19

Is this correct behaviour for SCHED_FIFO threads? Is there any flag or config option that can change this scheduler behaviour?

Selfinterest answered 7/4, 2022 at 14:40 Comment(0)
C
1

If I'm understanding correctly, you're trying to use SCHED_FIFO with hyperthreading ("HT") enabled, which results in multiple thread processors per physical core. My understanding is that HT-awareness within the Linux kernel is mainly through the load balancing and scheduler domains within CFS (the default scheduler these days). See https://mcmap.net/q/533680/-why-does-linux-39-s-scheduler-put-two-threads-onto-the-same-physical-core-on-processors-with-hyperthreading for more info.

Using SCHED_FIFO or SCHED_RR would then essentially bypass HT handling, since RT scheduling doesn't really go through CFS.

My approach to dealing with this in the past has been to disable hyperthreading. For cases where you actually need real-time behavior, this is usually the right latency/performance tradeoff to make anyway (see https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application#Hyper_threading). Whether this is appropriate really depends on what problem you're trying to solve.

Aside: I suspect if you actually need SCHED_FIFO behavior then disabling HT is what you'll want to do, but it's also common for people think that they need SCHED_FIFO when it's the wrong tool for the job. My suspicion is that there may be a better option than using SCHED_FIFO since you're describing running on a conventional server rather than an embedded system, but that's an over-generalizing guess. Hard to say without more specifics about the issue.

Cordwain answered 13/4, 2022 at 21:6 Comment(3)
Yes, I have HyperThreading enabled. While it is possible that SCHED_FIFO is completely bypassing load balancing it is somehow surprising, because this behaviour has changed starting from kernel 4.19. So either SCHED_FIFO balancing was some side-effect of earlier bug or it was removed on purpose (which I can't find any indication)Selfinterest
It is possible that we are overusing SCHED_FIFO. This is not actually embedded or RT system, but HPC focused on throughput. But from past experience we have seen Complete Fair Scheduler being, well, "fair" and pre-empting our high-priority worker threads to give CPU time to other threads, trashing cache and causing all sorts of performance issues. Although with recent improvements in scheduling this might be no longer an issue.Selfinterest
I would still like to have some kind of indication that new behaviour starting from Kernel 4.19 is "as designed" or maybe there is configuration option either in runtime or in Kernel Config that can change that. Because it seems as quite big behaviour change that went pretty silent.Selfinterest
S
0

The problem was caused by this particular change: https://lkml.iu.edu/hypermail/linux/kernel/1806.0/04887.html

Per CPU core watchdog threads were removed

watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);

Before, there were run periodically every 4 seconds and because they were absolutely highest priority, they were causing periodic rescheduling. When they are gone, there is nothing that can pre-empt SCHED_FIFO threads and migrate them to "better" core. So this was all just side effect of watchdog implementation. In general there is no mechanism in kernel that will perform rebalancing of runaway RT threads.

Selfinterest answered 28/4, 2022 at 10:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.