Creating a friendly timed busy loop for a hyperthread
Asked Answered
H

1

5

Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).

The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.

So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.

I guess the idea would be a long-latency instruction that doesn't use many resources, like pause and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P we just issue P/L of these instructions for a calibrated busy-sleep. Well pause doesn't meet that latter criterion, as its latency varies a lot1.

A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc or some other clock reading method (clock_gettime, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.

Any better options?


1 Also pause has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.

Head answered 7/8, 2017 at 22:40 Comment(16)
Seems like a lot of effort for little or no value. Your main thread needs to read the timestamp - why bother with a "helper thread"? Just read the CPU time or whatever you need in your main thread when/where you actually need it... It's not supposed to be slower than reading it from "a shared timestamp"...Sapota
@Sapota - why don't you tell me how long a single L1-resident read from the main thread takes versus say clock_gettime? Do you think the former is twice as fast? Five times? I guess I appreciate your apparent concern that I might waste my effort, but I'm OK with experimentation.Head
Square roots maybe? Single µop, huge latency, but would burn some power and affect square roots done by the main thread (does it do any?). And I wonder how intentionally mispredicted branches would act.Atkinson
@Atkinson - thanks! I had scrolled through Agner's instruction timings doc and the "heavy" x87 FP instructions like fdiv and sqrt definitely stuck out as having long latency with only one uop (versus for example integer division which has a ton of uops). It's wasn't clear to me if this blocks the port (p0 on Skylake) the whole time or only the FP unit (or some sub-part) as you suggest? I'm not going to be doing any FP on the main thread - but integer AVX possible. Great suggestion though, I'll test it.Head
I'm reasonably sure that it's only the FU that is blocked (if it's not pipelined) and the port is only used for issue/writeback. So eg a bunch of movd's is only slowed down slightly by the occasional sqrtsd.Atkinson
Experimentation is fine, and just having fun playing around is great, but for practical reasons as well as portability and compatibility - it's something I'd avoid. Too answer your question, I have no idea what the timings are, but I'm willing to bet (quite heavily) it's negligible compared to anything else you do in your CPU intensive task. If you have a tight inner loop, wrap it with a slower loop (1 : 10000 for example) and do your lookup there. That's my 2 cents anyway...Sapota
I'm even willing to bet the costs of synchronization of the shared variable between the two threads will be greater that reading the value in the first place.Broder
@Broder - it could be, but it is different than my understanding (when it comes to hyperthreaded siblings). It's important enough that I've asked a separate question here exactly covering that topic.Head
As it turns out, this idea isn't too crazy at all - although the approach there is somewhat inverted (the helper thread reads the counters and notes the IP of the primary thread, rather than my approach of having the helper thread write counters that the main thread reads).Head
Is the sleep period supposed to be constant-time or constant clock cycles? Because the max turbo frequency depends on a lot of things other than just what this core is doing; e.g. how many other cores are active, as well as thermal and power conditions.Asphodel
Either is OK and both are useful in slightly different scenarios (i.e., measuring cycles vs real time), but ultimately one can be converted into the other, since the sleep doesn't necessarily have to relate the underlying metric. For example, perhaps you want to measure real time, but the sleep happens to be a fixed number of cycles: when you wake from sleep, you just issue a rdtsc and write that as you "real time" value. The reverse case works similarly (with a rdpmc call, for example, to read cycles). @PeterCordesHead
However, in the case that the "sleep" unit happens to line up with the underlying metric you want to measure on the primary thread and it has a predictable latency, you can perhaps avoid the rdtsc or rdpmc or whatever call to read the underlying entirely and just calculate it directly. For example, if you know fsqrt always takes 33 cycles, and you want to measure cycles, you just increment the cycles variable by 33 (or 33 + N for the increment) in between each `fsqrt. So that's a useful feature, but the main purposes of the sleep is to be friendly to the other core.Head
It would be interesting to play with FP denormals for mul or div, or x87 slowdowns with NaN/Infinity. Those have very high latency, but IDK if the "FP assist" generated to handle that is a lot of microcode uops, or if it's done inside the execution unit. Probably microcode, which makes it not particularly useful for a low-HT-interference thread. A movnti store / reload would use up memory resources instead of ALU. That's definitely going to be variable latency so only usable with rdtsc (not calibration), but will sleep for ~500 cycles, so it's pretty light-weight.Asphodel
mfence is 3 uops, 33c throughput on Haswell. Maybe even more friendly than spinning on pause on pre-Skylake. rdrand is 16 uops, one per ~460c throughput on Skylake.Asphodel
Actually, with competition from the "main" thread, I don't think you could just calibrate a loop unless you're willing to accept some significant variability. You'd want to unroll it several times to reduce uops. pause is 4 uops for a 2 byte instruction, so it will easily fill up the uop cache lines for an aligned loop with much unrolling. That's maybe ok, as long as there's no penalty for one thread running from the legacy decoders while the other runs from the uop cache. I assume not.Asphodel
@PeterCordes - significant variability is more or less OK, especially if it is uncorrelated with what the main thread is doing. Imagine it is used for low-overhead method timing: variability is mostly OK if uncorrelated. It seems like at least some types of instructions wouldn't suffer too much competition at least in some scenarios (e.g., floating point instructions used to sleep while the main code is running integer code)... but something like Skylake pause really seems close to ideal. Pre-Skylake it's less clear.Head
H
1

Some random musing on the subject.

So you want to have a time stamp on a 100 MHz sample, that means that on a 4GHz cpu you have 40 cycles between each call.

The timer thread busily reads the real time clock (RTDSC???) but can't use the save method with cpuid as that takes 100 cycles. The old real time clock has a latency of around 25(and a throughput of 1/25), there might be a slightly newer, slightly more accurate with slightly more latency timer (32 cycles).

  start:
  read time (25 cycles)
  tmp = time - last (1 cycle)
  if tmp < sample length goto start
  last += cycles between samples
  sample = time
  goto start

In a perfect world the branch predictor will guess right every time, in reality it will mispredict randomly adding 5-14 cycles to the loops 26 cycles due to variance in the read time cycles.

When the sample is written the other thread will have its instructions cancelled from the first speculative loads from this cache line (remember to align to 64 byte for the sample position so no other data is affected). And the load of the sample time stamp starts over after a delay of ~5-14 cycles depending on where the instructions come from, the loop buffer, micro-ops cache or I-cache.

So a mimimum of 5->14 cycles / 40 cycles performance will be lost, in addition to half the cpu being used by the other thread.

On the other hand reading the real time clock in the main thread would cost ...

~1/4 cycle, the latency will most likely be covered by other instructions. But then you can't vary the frequency. The long latency of 25 cycles could be a problem unless some other long latency instructions precede it.

Using a CAS instruction (lock exch???) might partly solve the problem as the loads then shouldn't cause a reissue of the instruction, but instead results in a delay on all following reads and writes.

Hy answered 17/8, 2017 at 22:7 Comment(5)
Hi Surt. What are you referring to by "the old real time clock"? It would be very nice if there was any hardware clock with a latency of 6 cycles, or are you referring to something like clock() that is updated by the OS? About rdtsc - the cpuid probably isn't strictly necessary here: it's mostly useful when you are issuing that instruction "inline" with the code being measured, in order to prevent reodering into/out of the timed region, but if you are just issuing a chain of rdtsc on a helper thread I expect to get reasonable values (and such could be tested).Head
There are 2 versions of the RTDSC(sp?) instruction.Hy
Yes, rdtsc and the newer rdtscp. The latter includes more synchronization and also reads the value of an MSR at the same time (usually this has the CPU number).Head
... but none of them take close to 6 cycles (at least back-to-back) on most modern x86 archs (a few exotic ones like some Xeon Phi models excluded).Head
I see my memory was a bit dated :) with 25 respectively 32 cycles on a Intel Skylake (according to Agner Fog).Hy

© 2022 - 2024 — McMap. All rights reserved.