Windows Sleep(0) vs The PAUSE instruction
Let me quote from the Intel 64 and IA-32 Architectures Optimization Reference Manual.
In multi-threading implementation, a popular construct in thread synchronization and for yielding scheduling
quanta to another thread waiting to carry out its task is to sit in a loop and issuing SLEEP(0).
These are typically called “sleep loops” (see example #1). It should be noted that a SwitchToThread call
can also be used. The “sleep loop” is common in locking algorithms and thread pools as the threads are
waiting on work.
This construct of sitting in a tight loop and calling Sleep() service with a parameter of 0 is actually a
polling loop with side effects:
- Each call to Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles.
- It also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.
- When there is no other thread waiting to take possession of control, this sleep loop behaves to the OS
as a highly active task demanding CPU resource, preventing the OS to put the CPU into a low-power
state.
Example #1. Unoptimized Sleep Loop
while(!acquire_lock())
{ Sleep( 0 ); }
do_work();
release_lock();
Example #2. Power Consumption Friendly Sleep Loop Using PAUSE
if (!acquire_lock())
{ /* Spin on pause max_spin_count times before backing off to sleep */
for(int j = 0; j < max_spin_count; ++j)
{ /* intrinsic for PAUSE instruction*/
_mm_pause();
if (read_volatile_lock())
{
if (acquire_lock()) goto PROTECTED_CODE;
}
}
/* Pause loop didn't work, sleep now */
Sleep(0);
goto ATTEMPT_AGAIN;
}
PROTECTED_CODE:
do_work();
release_lock();
Example #2 shows the technique of using PAUSE instruction to make the sleep loop power friendly.
By slowing down the “spin-wait” with the PAUSE instruction, the multi-threading software gains:
- Performance by facilitating the waiting tasks to acquire resources more easily from a busy wait.
- Power-savings by both using fewer parts of the pipeline while spinning.
- Elimination of great majority of unnecessarily executed instructions caused by the overhead of a
Sleep(0) call.
In one case study, this technique achieved 4.3x of performance gain, which translated to 21% power savings at the processor and 13% power savings at platform level.
Pause Latency in Skylake Microarchitecture
The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles, so performance-wise it is more beneficial to wait while occupying the CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject on Windows OS.
The PAUSE instruction is intended to:
- Temporarily provide the sibling logical processor (ready to make forward progress exiting the spin loop) with competitively shared hardware resources. The competitively-shared microarchitectural resources that the sibling logical processor can utilize in the Skylake microarchitecture are: (1) More front end slots in the Decode ICache, LSD and IDQ; (2) More execution slots in the RS.
- Save power consumed by the processor core compared to executing equivalent spin loop instruction sequence in the following configurations: (1) One logical processor is inactive (e.g. entering a C-state); (2) Both logical processors in the same core execute the PAUSE instruction; (3) HT is disabled (e.g. using BIOS options).
The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles.
The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked on executing a fixed number of looped PAUSE instructions.
There's also a small power benefit in 2-core and 4-core systems. As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.
You can find more information on this issue in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" and "Intel 64 and IA-32 Architectures Software Developer’s Manual", along with the code samples.
My Opinion
It is better make program logic flow in such a way that neither Sleep(0) nor the PAUSE instruction are ever needed. In other words, avoid the “spin-wait” loops altogether. Instead, use high-level synchronization functions like WaitForMultipleObjects()
, SetEvent()
, and so on. Such high-level synchronization functions are the best way to write the programs. If you analyze available tools (at your disposition) from the terms of performance, efficiency and power saving - the higher-level functions are the best choice. Although they also suffer from expensive context switches and ring 3 to ring 0 transitions, these expenses are infrequent and are more than reasonable, compared to what you would have spent in total for all the “spin-wait” PAUSE cycles combined, or the cycles with with Sleep(0).
On a processor supporting hyper-threading, “spin-wait” loops can consume a significant portion of the execution bandwidth of the processor. One logical processor executing a spin-wait loop can severely impact the performance of the other logical processor. That's why sometimes disabling hyper-threading may improve performance, as have been pointed out by some people.
Consistently polling for devices or file or state changes in the program logic workflow can cause the computer to consume more power, to put stress on memory and the bus and to provide unnecessary page faults (use the Task Manager in Windows to see which applications produce most page faults while in the idle state, waiting for user input in the background - these are most inefficient applications since they are using the poling above mentioned). Minimize polling (including the spin-loops) whenever possible and use an event driven ideology and/or a framework if available - this is the best practice that I highly recommend. You application should literally sleep all the time, waiting for multiple events set up in advance.
A good example of an event-driven application is Nginx, initially written for unix-like operating systems. Since the operating systems provide various functions and methods to notify your application, use these notifications instead of polling for device state changes. Just let your program to sleep infinitely until a notification will arrive or a user input will arrive. Using such a technique reduces the overhead for the code to poll the status of the data source, because the code can get notifications asynchronously when status changes happen.
Sleep(0)
doesn't do what you think it does. "A value of zero causes the thread to relinquish the remainder of its time slice to any other thread that is ready to run." – Nidifystd::atomic<bool> t
. Any decent compiler will hoist the load oft
out of the loop unless it'satomic
. – Unstrung