Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).
The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.
So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.
I guess the idea would be a long-latency instruction that doesn't use many resources, like pause
and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P
we just issue P/L
of these instructions for a calibrated busy-sleep. Well pause
doesn't meet that latter criterion, as its latency varies a lot1.
A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc
or some other clock reading method (clock_gettime
, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.
Any better options?
1 Also pause
has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.
clock_gettime
? Do you think the former is twice as fast? Five times? I guess I appreciate your apparent concern that I might waste my effort, but I'm OK with experimentation. – Headfdiv
and sqrt definitely stuck out as having long latency with only one uop (versus for example integer division which has a ton of uops). It's wasn't clear to me if this blocks the port (p0 on Skylake) the whole time or only the FP unit (or some sub-part) as you suggest? I'm not going to be doing any FP on the main thread - but integer AVX possible. Great suggestion though, I'll test it. – Headmovd
's is only slowed down slightly by the occasionalsqrtsd
. – Atkinsonrdtsc
and write that as you "real time" value. The reverse case works similarly (with ardpmc
call, for example, to read cycles). @PeterCordes – Headrdtsc
orrdpmc
or whatever call to read the underlying entirely and just calculate it directly. For example, if you knowfsqrt
always takes 33 cycles, and you want to measurecycles
, you just increment the cycles variable by 33 (or 33 + N for the increment) in between each `fsqrt. So that's a useful feature, but the main purposes of the sleep is to be friendly to the other core. – Headmovnti
store / reload would use up memory resources instead of ALU. That's definitely going to be variable latency so only usable withrdtsc
(not calibration), but will sleep for ~500 cycles, so it's pretty light-weight. – Asphodelmfence
is 3 uops, 33c throughput on Haswell. Maybe even more friendly than spinning onpause
on pre-Skylake.rdrand
is 16 uops, one per ~460c throughput on Skylake. – Asphodelpause
is 4 uops for a 2 byte instruction, so it will easily fill up the uop cache lines for an aligned loop with much unrolling. That's maybe ok, as long as there's no penalty for one thread running from the legacy decoders while the other runs from the uop cache. I assume not. – Asphodelpause
really seems close to ideal. Pre-Skylake it's less clear. – Head