_mm_pause usage in gcc on Intel

Asked 6/5, 2016 at 3:24 Answered 16/5, 2019 at 2:31

I have refered to this webpage : https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops , the following I can not understand :

the pause instruction gives a hint to the processor that the calling thread is in a "spin-wait" loop. In addition, the pause instruction is a no-op when used on x86 architectures that do not support Intel SSE2, meaning it will still execute without doing anything or raising a fault. While this means older x86 architectures that don’t support Intel SSE2 won’t see the benefits of the pause, it also means that you can keep one straightforward code path that works across the board.

I like to know , lscpu in linux will showes cpu information , but I have no idea if the cpu i have support SSE2 or not , how can I check it myself ?!

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz
Stepping:              2
CPU MHz:               3599.882
BogoMIPS:              6804.22
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

Also , currently I use _mm_pause or __asm volatile ("pause" ::: "memory"); the cpu idle will be exhausted to zero in that core , but the following code using nanosleep is too slow for me :

while(1){
    nanosleep();
    dosomething..... ; 
}

I observe nanosleep will delayed 60 microseconds in my box , Is there any solution faster than nanosleep also not to exhaust cpu core like _mm_pause() or __asm volatile ("pause" ::: "memory") ?!

Edit :

struct timespec req={0};
req.tv_sec=0;
req.tv_nsec=100 ;
nanosleep(&req,NULL) ;

This nanosleep cost 60 microseconds in the box I have which cpu is above , I have no idea how come it happened ?!

Avalanche answered 6/5, 2016 at 3:24 Comment(1)

egrep -o '(sse|avx)[0-9]*' /proc/cpuinfo | sort -u. (sort -u because there's a line for each core, and you don't want that. grep -o prints only the matching text, not the whole line that matched). – Guss 6/5, 2016 at 20:1

To check if your platform supports SSE2

gcc -march=native -dM -E - </dev/null | grep SSE

But you don't need to check for support: The pause instruction safely decodes as a NOP on CPUs that don't recognize it as pause. (The encoding is basically rep nop). It's unlikely that a nop instead of a 5 or 100 cycle pause in the pipeline could be a correctness problem for your code.

_mm_pause won't release CPU for scheduler, as you mentioned it's designed for another purpose, e.g. hint for microarchitecture components.

nanosleep, if used correctly, should give you finer control than *60us (you might need to change the scheduler to RT). I suggest you check your code to see if arguments are correctly set, etc.

--Edit--

The accuracy of the nanosleep function depends on the kernel. And its behavior for short sleep is just busy loop (see reference) in glibc. It's also impossible to yield to scheduler for an interval (say, a few nano seconds) that is less than scheduler ticks (determined by CONFIG_HZ, which normally is 250, 1000, etc) since scheduler only context switch when timer fires.

Also, just idling the CPU for a few nanoseconds won't actually save power. CPU power is save either by C-State or P-State. P-State uses frequency scaling while C-State shuts down component of CPU. Although there is halt instruction that could do such state transition but it takes time to do so (latency in us range) which makes it expensive.

Reference:

http://tldp.org/HOWTO/IO-Port-Programming-4.html

http://ena-hpc.org/2014/pdf/paper_06.pdf

Whenever answered 6/5, 2016 at 6:0 Comment(4)

The OP said 60us, not 60ms. – Guss 6/5, 2016 at 19:58

sorry I read it wrong...thanks for bring it up... I will do an edit. – Whenever 6/5, 2016 at 21:28

Thanks , I try to nanasleep less than 1 microsecond , but nanasleep sleep 60 microseconds , see my edit .....what a surprise . – Avalanche 8/5, 2016 at 23:18

@Avalanche yeah. It's the same on mine as well. There has been plenty of those precision measurements from folks trying to use it for accurate timing and 60us seems to be common unless you use TSC directly. – Whenever 9/5, 2016 at 2:33

I think an easy solution (faster than nanosleep) is to use multiple pause instructions.

Also, please note that

It is important to note that the number of cycles delayed by the pause instruction may vary from one processor family to another. You should avoid using multiple pause instructions, assuming you will introduce a delay of a specific cycle count.

Mentioned in Benefitting Power and Performance Sleep Loops

Gullah answered 16/5, 2019 at 2:31 Comment(2)

Your URL looks like a copy/paste of the quote body. – Guss 17/10, 2019 at 4:0

This is the actual link. Its from Intel and a nice read: Benefitting Power and Performance Sleep Loops Or plain URL: software.intel.com/en-us/articles/… – Unlikelihood 24/4, 2020 at 14:26

Recommended topics

Hot tags