Why one non-voluntary context switch per second?

Asked 26/12, 2012 at 6:58 Answered 12/6, 2023 at 22:31

Solved linux-kernel operating-system scheduling kernel

The OS is RHEL 6 (2.6.32). I have isolated a core and am running a compute intensive thread on it. /proc/{thread-id}/status shows one non-voluntary context switch every second.

The thread in question is a SCHED_NORMAL thread and I don't want to change this.

How can I reduce this number of non-voluntary context switches? Does this depend on any scheduling parameters in /proc/sys/kernel?

EDIT: Several responses suggest alternative approaches. Before going that route, I first want to understand why I am getting exactly one non-voluntary context switch per second even over hours of run. For example, is this caused by CFS? If so, which parameters and how?

EDIT2: Further clarification - first question I would like an answer to is the following: Why am I getting one non-voluntary context switch per second instead of, say, one switch every half or two seconds?

Lytle answered 26/12, 2012 at 6:58 Comment(6)

Why would you care? Even 100 context switches per second is noise on a modern system. – Stonewall 26/12, 2012 at 7:6

Its financial app where latency is at a premium and evey context switch may be a (or more) lost opportunity. I would like to understand what system tuning parameters determine non-voluntary context switch rate of compute intensive threads on isolated cores. – Lytle 26/12, 2012 at 7:12

It's most likely blocking on either a lock, normal disk I/O, or a page fault. – Stonewall 26/12, 2012 at 7:32

None of those. The rate is precisely one non-voluntary context switch per second over hours of run. I am almost certain that CFS is doing this - but based on what scheduling parameters? – Lytle 26/12, 2012 at 7:37

Like @DavidSchwartz says, if this is an issue, you need a dedicated box and a real-time OS, not a general-purpose desktop. Context-switches are, nearly always, a gained opportunity because of the good I/O performance achieved. 'one non-voluntary context switch every second' - what? Like David says, who cares? Optimize something that matters.... – Magniloquent 27/12, 2012 at 1:47

@Martin: Do you understand why this one context switch is happening with regularity? The box is dedicated. No point jumping to an alternate solution without understanding what is causing the current issue. In case it is not clear yet - I want to understand why I am getting a context switch every second instead of, say, one every half or two seconds. Surely some combination of machine and/or OS configuration - what? – Lytle 27/12, 2012 at 3:32

This is a guess, but an educated one - since you use an isolated CPU the scheduler does not schedule any task except your own on it with one exception - the vmstat code in the kernel has a timer that schedules a single work queue item on each CPU once per second to calculate memory usage statistics and this is what you are seeing gets scheduled each second.

The work queue code is smart enough to not schedule the work queue kernel thread if the core is 100% idle but not if it is running a single task.

You can verify this using ftrace. If the sched_switch tracer shows that the entity you switch to once every second or so (the value is rounded to the nearest jiffie events and the timer does not count when the cpu is idle so this might skew the timing) is the events/CPU_NUMBER task (or keventd for older kernels), then it's almost 100% that the cause is indeed the vmstat_update function setting its timer to queue a work queue item every second which the events kernel thread runs.

Note that the cycle at which vmstat sets its timer is configurable - you can set it to other value via the vm.stat_interval sysctl knob. Increasing this value will give you a lower rate of such interruptions at the cost of less accurate memory usage statistics.

I maintain a wiki with all the sources of interruptions to isolated CPU work loads here. I also have a patch in the works for getting vmstat to not schedule the work queue item if there is no change between one vmstat work queue run to the next - such as would happen if your single task on the CPU does not use any dynamic memory allocations. Not sure it will benefit you, though - it depends on your work load.

Garner answered 2/1, 2013 at 19:27 Comment(1)

vm.stat_interval - awesome. I have nonvoluntary context switches at one every 10 seconds; this is just the knob I needed to understand. – Intent 12/6, 2023 at 21:4

If one interrupt per second on your dedicated CPU is still too much, then you really need to not go through the normal scheduler at all. May I suggest the real-time and isochronous priority levels, that can leave your process scheduled more reliably than the usual pre-emptive mechanisms?

Marita answered 26/12, 2012 at 9:39 Comment(0)

Here it is 2023 and I am drawn to this question by Google. Thanks to @gby, I discover a little knob to turn called vm.stat_interval, also found in /proc/sys/vm/stat_interval. His answer is the best, but I thought I would amend it.

12 years after the original post, if anybody doesn't know, you can tune your OS and isolate your CPUs and still get interrupts. For some time we have had the tuned project (https://github.com/redhat-performance/tuned) which allows you to do so. But the OS still needs to do its memory housekeeping, and 1 Hz is the default. This is controlled by that stat_interval setting via sysstat.

There's a good article at https://www.suse.com/c/cpu-isolation-introduction-part-1/ which discusses CPU isolation, including eliminating interrupts. As the article says, "some more specialized needs may clearly stumble on the noise within their way. This is the case for processing that require the entire CPU time and can’t suffer any cycle theft." One example cited is DPDK which is utilized by the Solarflare network card drivers and Onload.

In any event, on otherwise isolated CPUs, using the cpu-partitioning tuned profile found in Red Hat systems, the vm.stat_interval is set to 10, and one's applications will be interrupted once every 10 seconds. This is the source of periodic nonvoluntary context switches on isolated CPUs.

Intent answered 12/6, 2023 at 22:31 Comment(0)

Recommended topics

Hot tags