Why does this delay-loop start to run faster after several iterations with no sleep?
Asked Answered
T

2

74

Consider:

#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;

const int times = 1000;
const int N = 100000;

void run() {
  for (int j = 0; j < N; j++) {
  }
}

int main() {
  clock_t main_start = clock();
  for (int i = 0; i < times; i++) {
    clock_t start = clock();
    run();
    cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
    //usleep(1000);
  }
  cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}

Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.

When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?

The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU @ 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).

Tambourin answered 11/7, 2016 at 4:6 Comment(7)
You should probably check the result of usleep() as it may be interrupted or it may do nothing because your parameter isn't valid. This would make any timings unreliable.Upturn
@John3136: The usleep is outside the timing window. He's timing a busy-loop either back to back or separated by 1ms sleeps.Perk
For benchmarking purposes, you should compile with gcc -O2 or (since your code is C++) g++ -O2 at least.Arithmomancy
@BasileStarynkevitch: Problem with that in the present example: At any optimization level, the compiler would dead-strip the whole code. (Which leads to the question: Why would I want to benchmark an empty, invariant loop in the first place?).Delanty
Then update the code to compute something meaningful, function of some unknown input (e.g. strtol(argv[1])...)Arithmomancy
If you sleep for 1000 microseconds, I'd expect the loop to take at least 1 millisecond. How are you measuring 0.4 ms?Adams
@AdrianMcCarthy: the usleep is outside the timing windowPerk
P
128

After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.

If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).

If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.

If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)


Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)



comments on other theories:

re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.

Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!

In more detail for x86:

The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.

The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.

gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.

The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).

Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).

Even if the loop-counter was kept in a register, 100k cycles is a lot.

Perk answered 11/7, 2016 at 5:57 Comment(5)
I increase the value of N by 100x and use cpufreq-info command, I found the cpu is still work on the minimum frequency when the code is running.Tambourin
@phyxnj: with usleep un-commented? It ramps up for me with N=10000000. (I use grep MHz /proc/cpuinfo since I never got around to installing cpufreq-utils on this machine). Actually I just found that cpupower frequency-info shows what cpufreq-info did, for one core.Perk
@phyxnj: are you sure you're looking at all cores, and not just one core? cpupower seems to default to just core 0.Perk
grep MHz /proc/cpuinfo shows cpu frequency increase indeed. cpufreq-info maybe monitor a random core of cpu. I think you are right, maybe this is the cause of the problem.Tambourin
@phyxnj: It's not random, and the core number is printed out in the output. e.g. thinkwiki.org/wiki/How_to_use_cpufrequtils. It almost certainly defaults to core 0 only. The only thing that's unpredictable is which core your process will run on.Perk
C
3

A call to usleep may or may not result in a context switch. If it does, it will take longer than if it doesn't.

Circassian answered 11/7, 2016 at 4:12 Comment(16)
usleep relinquishes the CPU voluntarily, so there should be a context switch for sure (even if the system is idle), isn't it?Veronica
@rakib Not if there's nothing to switch contexts to or if the time interval is too short. When you're talking about less than 10ms or so, the OS may decide not to do a context switch.Circassian
@rakib: There's a switch to kernel mode and back for sure. There might not be a switch to another process before resuming the process that called usleep, so pollution of caches / TLBs / branch predictors may be minimal.Perk
@David: Linux handles at nanosecond level of precision, so ms shouldn't be the issue. But what happens when a task relinquishes the CPU, at the time of wake up, tasks gets back into the run queue and it doesn't guarantee that task will be running immediately. Perhaps that might be a reason.Veronica
@rakib It has nothing to do with precision. It has to do with a binary decision whether or not to perform a context switch. It's a yes or no decision.Circassian
@Peter: well if there's no runnable task the CPU will remain idle. Then what process that called usleep does when it doesn't switch? Spins over the CPU? And if there's no other process to run is there any chance of cache /TLB invalidation?Veronica
Cache / TLB pollution isn't important at all for this experiment. There's nothing inside the timing window other than the call to clock() itself that touches memory other than the stack, and clock() will be hot on the second call. Besides clock, most of the cycles (100k * ~6 cycles per iteration (store-forwarding latency) for a loop that keeps its counter in a local on the stack, on a typical x86) are spent in a busy-loop that only touches stack memory. I think gcc -O3 would optimize away the loop entirely, but 100k * 1 cycle is still a lot compared to one miss in clock.Perk
@rakib It's up to the OS based on some complex decision making involving things like power management and what hardware timers are available on the platform. It may spin. It may use a hardware timer. Above 10ms, pretty much all modern linux platforms use hardware sleep. Below, it's hardware specific. It may depend on what other tasks read ready-to-run, it may depend on how much of the timeslice is left, it may depend on dynamic scheduler priorities (what this task has done in the recent past), and so on.Circassian
@rakib: The process is asleep. I think you mean "what does the CPU that was running the process that called usleep do?" If the sleep is short enough, yes it may just spin in a calibrated delay-loop. Otherwise the CPU will set a timer to wake itself up at the right time, then run a HLT or something to enter a power-saving sleep if there's nothing else to run, and no kernel housekeeping tasks to do. IDK Linux internals well enough to know how much work could be done, and how much cache pollution could occur.Perk
I was trying to look (quick look) into the relevant Linux code, but didn't find any such things rather inserting a timer event for future wake-up and calls schedule(). Thats what I was trying to say. I know there are hardware/software based timers and also there's kernel config issues. But, the point of all sleep is to relinquish the CPU therefore a context switch. Just to back my point wrote a program which just usleep(10) and ran perf stat on it, output showed there's 1 context switch during it's run time (just an indication)Veronica
@rakib Then schedule figures out how long until the next thing it has to do and then decides how to wait, possibly scheduling something else, possibly using a hardware timer, possibly not.Circassian
@rakib: Depending on terminology, a "context switch" can mean that another user-space thread/process runs on the current CPU, or just a transition from user mode to kernel mode (and back to the same process), like for a trivial system call like getpid or gettimeofday (Or a better example would be a syscall that doesn't have a user-space implementation exported in the VDSO, but making those syscalls is still possible with asm). The transition to kernel mode and back always happens, even if the kernel just did an iret or sysreturn right away (on x86 for example).Perk
@rakib: If nothing significant runs on the CPU before it returns to the caller of usleep, some people would say there was no context switch, even if the hardware went to sleep (with a hlt instruction) for a short time. There's definitely minimal cache / TLB pollution in that case, and IIRC no TLB invalidation. (I forget exactly how the page tables for kernel mode work, but I don't think the whole TLB has to be blown away every system call).Perk
@Peter Typically context switch indicates the change in the current execution context. That execution context might be running user space or kernel space code. To clarify the thing that whether context switch happen at time of calling usleep or not. I tried to point out that, yes it happens rather may or may not happen.Veronica
@rakib: If your definition of context switch include a transition from user mode to kernel mode, then obviously every system call including usleep causes two (on entry and on return). More may happen if another user-space task gets the CPU before returning. IDK what you're trying to say with "may not happen".Perk
@Peter No, I don't mean that, if for some reason it appears to you, then I can't help here. By saying "may not happen" I meant, that is what David posted in his answer.Veronica

© 2022 - 2024 — McMap. All rights reserved.