Reliability of Xcode Instrument's disassembly time profiling
Asked Answered
I

2

11

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results:

Instrument Screenshot

I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing. This causes me to believe these results are unreliable. Is this true and known? Or am I just experiencing an Instruments bug? Or is there some option I need to use to obtain reliable results?

Is there any reference expanding on this issue?

Isoagglutinin answered 21/1, 2018 at 16:58 Comment(10)
It seems very likely that instruments is experiencing "skid" - i.e., the time from an expensive instructions tends to be spread over subsequent instructions rather than allocated correctly to the slow one. This can be mostly mitigated using Intel's PEBS (precise samping). If you share the entire loop in assembly along with "typical" input to drive it, I can run some tests locally to see if I see similar results with various PEBS settings.Weissman
Also, divps itself doesn't get any counts because it doesn't have to wait for its input. The counts apply to instructions that have to wait for the slow divss result. (But that's not the full explanation; looks like it should bottleneck on divss throughput unless you're on Skylake. And there are lots of counts for instructions later in the chain, not concentrated on the first instruction using the divss result.)Diaconal
@BeeOnRope: Andy Glew posted an interesting answer that explains the internals of perf-counter interrupts in the Intel P6 microarchitecture's pipeline, and why (before PEBS) they were always delayed.Diaconal
@PeterCordes - yeah I read it recently. To be clear, even today the non-PEBS way of sampling is inherently imprecise since it still relies on interrupts and IP examination: even if the interrupt is signaled synchronously by the instruction that cause the PMU counter to overflow (if such a concept is even well-defined for the event), the pipeline is usually going to be in a state with many instructions in flight, retired and not, executed and not, etc, etc. When the interrupt happens all the speculative stuff gets thrown away and you are left with the IP pointing to the last retired...Weissman
... instruction prior to the interrupt - but that many have nothing to do with the instruction that triggered the overflow (unless it was was a "retired" type event). That's kind of a limitation of IP-based sampling: interrupts have to put the CPU into a consistent state where you have a single IP at which to continue, but this type of cleanup is inconsistent with sampling many types of events. PEBS writes stuff into an internal buffer rather than use the interrupt mechanism, and then you read it out later, so it avoids all that (that needs special support which is why only some events work).Weissman
Beyond that, even with PEBS, I don't really see how you do a precise "cycles" event. How do you allocate cycles to instructions when so many are in flight at the same time? I guess a reasonable approach is to allocate it to the oldest un-retired instructoin every cycle, but this may still pick instructions that are off the critical path in some cases and so don't actually contribute to execution time (but I don't think that problem is solvable by the PMU anyways). It's... complicated - see also here, for example.Weissman
Thanks folks! Would any of you be interesting in sharing this info in an answer form, so that I can mark it as the accepted answer as well with the "bounty"? :)Isoagglutinin
I have one half-written that I might get around to finishing, showing latency vs. throughput bottlenecks and what kind of distribution of counts for cycles you get in a small loop. Oh, didn't realize the bounty was expiring so soon, yeah I guess I should get on that.Diaconal
@BeeOnRope: I posted an answer here, but I don't look at perf record profiles very often. How much of what I made up is just plain wrong?Diaconal
@PeterCordes - I'll comment on your answer.Weissman
D
4

First of all, it's possible that some counts that really belong to divss are being charged to later instructions, which is called a "skid". (Also see the rest of that comment thread for some more details.) Presumably Xcode is like Linux perf, and uses the fixed cpu_clk_unhalted.thread counter for cycles instead of one of the programmable counters. This is not a "precise" event (PEBS), so skids are possible. As @BeeOnRope points out, you can use a PEBS event that ticks once per cycle (like UOPS_RETIRED < 16) as a PEBS substitute for the fixed cycles counter, removing some of the dependence on interrupt behaviour.

But the way counters fundamentally work for pipelined / out-of-order execution also explains most of what you're seeing. Or it might; you didn't show the complete loop so we can't simulate the code on a simple pipeline model like IACA does, or by hand using hardware guides like http://agner.org/optimize/ and Intel's optimization manual. (And you haven't even specified what microarchitecture you have. I guess it's some member of Intel Sandybridge-family on a Mac).


Counts for cycles are typically charged to the instruction that's waiting for the result, not usually the instruction that's slow to produce the result. Pipelined CPUs don't stall until you try to read a result that isn't ready yet.

Out-of-order execution massively complicates this, but it's still generally true when there's one really slow instruction, like a load that often misses in cache. When the cycles counter overflows (triggering an interrupt), there are many instruction in flight, but only one can be the RIP associated with that performance-counter event. It's also the RIP where execution will resume after the interrupt.

So what happens when an interrupt is raised? See Andy Glew's answer about that, which explains the internals of perf-counter interrupts in the Intel P6 microarchitecture's pipeline, and why (before PEBS) they were always delayed. Sandybridge-family is similar to P6 for this.

I think a reasonable mental model for perf-counter interrupts on Intel CPUs is that it discards any uops that haven't yet been dispatched to an execution unit. But ALU uops that have been dispatched already go through the pipeline to retirement (if there aren't any younger uops that got discarded) instead of being aborted, which makes sense because the maximum extra latency is ~16 cycles for sqrtpd, and flushing the store queue can easily take longer than that. (Pending stores that have already retired can't be rolled back). IDK about loads/stores that haven't retired; at least the loads are probably discarded.

I'm basing this guess on the fact that it's easy to construct loops that don't show any counts for divss when the CPU is sometimes waiting for it to produce its outputs. If it was discarded without retiring, it would be the next instruction when resuming the interrupt, so (other than skids) you'd see lots of counts for it.

Thus, the distribution of cycles counts shows you which instructions spend the most time being the oldest not-yet-dispatched instruction in the scheduler. (Or in case of front-end stalls, which instructions the CPU is stalled trying to fetch / decode / issue). Remember, this usually means it shows you the instructions that are waiting for inputs, not the instructions that are slow to produce them.

(Hmm, this might not be right, and I haven't tested this much. I usually use perf stat to look at overall counts for a whole loop in a microbenchmark, not statistical profiles with perf record. addss and mulss are higher latency than andps, so you'd expect andps to get counts waiting for its xmm5 input if my proposed model was right.)

Anyway, the general problem is, with multiple instructions in flight at once, which one does the HW "blame" when the cycles counter wraps around?


Note that divss is slow to produce the result, but is only a single-uop instruction (unlike integer div which is microcoded on AMD and Intel). If you don't bottleneck on its latency or its not-fully-pipelined throughput, it's not slower than mulss because it can overlap with surrounding code just as well.

(divss / divps is not fully pipelined. On Haswell for example, an independent divps can start every 7 cycles. But each only takes 10-13 cycles to produce its result. All other execution units are fully pipelined; able to start a new operation on independent data every cycle.)

Consider a large loop that bottlenecks on throughput, not latency of any loop-carried dependency, and only needs divss to run once per 20 FP instructions. Using divss by a constant instead of mulss with the reciprocal constant should make (nearly) no difference in performance. (In practice out-of-order scheduling isn't perfect, and longer dependency chains hurt some even when not loop-carried, because they require more instructions to be in flight to hide all that latency and sustain max throughput. i.e. for the out-of-order core to find the instruction-level parallelism.)

Anyway, the point here is that divss is a single uop and it makes sense for it not to get many counts for the cycles event, depending on the surrounding code.


You see the same effect with a cache-miss load: the load itself mostly only gets counts if it has to wait for the registers in the addressing mode, and the first instruction in the dependency chain that uses the loaded data gets a lot of counts.


What your profile result might be telling us:

  • The divss isn't having to wait for its inputs to be ready. (The movaps %xmm3, %xmm5 before the divss sometimes takes some cycles, but the divss never does.)

  • We may come close to bottlenecking on the throughput of divss

  • The dependency chain involving xmm5 after divss is getting some counts. Out-of-order execution has to work to keep multiple independent iterations of that in flight at once.

  • The maxss / movaps loop-carried dependency chain may be a significant bottleneck. (Especially if you're on Skylake where divss throughput is one per 3 clocks, but maxss latency is 4 cycles. And resource conflicts from competition for ports 0 and 1 will delay maxss.)


The high counts for movaps might be due to it following maxss, forming the only loop-carried dependency in the part of the loop you show. So it's plausible that maxss really is slow to produce results. But if it really was a loop-carried dep chain that was the major bottleneck, you'd expect to see lots of counts on maxss itself, as it would be waiting for its input from the last iteration.

But maybe mov-elimination is "special", and all the counts for some reason get charged to movaps? On Ivybridge and later CPUs, register copies doesn't need an execution unit, but instead are handled in the issue/rename stage of the pipeline.

Diaconal answered 11/2, 2018 at 14:19 Comment(7)
One thing I'm not sure about: But ALU uops that have been dispatched already go through the pipeline to retirement instead of being aborted. Why do you think this is the case? Is it a special feature of PMU interrupts? Certainly I don't think normal interrupts work like this: AFAIK all in-flight instructions (i.e., not yet retired) will be tossed, even if they have been executed. Only the store buffer is preserved (since this has post-retirement state). My guess is that PMU interrupts don't work differently. If you wanted to preserve (commit) already executed instructions, ...Weissman
... you'd have to execute a bunch more instructions so that the state at interrupt has a single consistent IP: any not-executed instructions that were older than the youngest executed instruction. That's probably complicated and could be a ton of work (if that's say 100 instructions and they're slow). So I think what you're looking at in the interrupt case is the IP pointing to the oldest unretired instruction when the interrupt occurred. So I'd think the part where you write "Thus, the distribution ..." should most likely say "the oldest not-yet-retired" - but this is without testing.Weissman
@BeeOnRope: I was basing this guess on the fact that divss doesn't get many counts when it doesn't have to wait for its inputs. If it's slow to produce outputs and is discarded without retiring, shouldn't it get lots of counts for cycles? I saw the same thing in experiments on my SKL where there's an independent divss feeding a loop-carried dep chain. Anyway, I was thinking that already-dispatched instructions may be allowed to continue to retirement if there aren't any younger non-executed instructions.Diaconal
That said, this is all about non-PEBS. A good profiler should use PEBS approaches even to sample cycles. perf supports cycles:p and cycles:ppp (I think pp is identical to p) and uses a counter that ticks every cycle, like "UOPS_RETIRED < 16" and PEBS (AFAIK) fills in a separate buffer with details of the event so the interrupt behavior isn't important (the interrupt is only needed to do the reading the data itself comes from the PEBS buffer).Weissman
yeah I'm not totally sure to be honest. On top of what I mentioned there could also be additional skid just in the way the way the interrupt occurs. In the case of a long-latency operation like divss perhaps the interrupt is delayed and when it completes may be it retires immediately if the retirement queue is empty and so never shows up? It could probably be sorted out through some testing - but PEBS makes the interrupt behavior considerably less interesting, so I'm not sure if a deep-dive is warranted.Weissman
@BeeOnRope: Yeah, I think the main thing is to understand that usually you get counts on instructions waiting for inputs, and to understand that stuff like skids are possible. To understand exactly why a certain instruction is being slow, you need to look at more than cycles. (And interesting point about using a PEBS counter to emulate cycles.) Anyway, given that, you can usually figure out what you need to know from a perf record without it being confusing.Diaconal
Peter, when you say "usually you get counts on instructions waiting for inputs", do you mean transitively too? That is, if X is a slow instruction and Y depends on an output of X, and Z depends on an output of Y, do you mean that "usually" you'll see counts on Y (which is waiting for X), or either Y or Z (the latter is waiting on X but only transitively)? The former interpretation doesn't seem to line up with the counts above: the addss which waits on the slow divss is only in 3rd place, while 2nd place mulss is only waiting for a 1 cycle latency andps.Weissman
I
1

Is this true and known?

Yes, it is a known problem with profiling tools on Intel x86. I've observed it (time spent suspiciously assigned to seemingly innocent instructions) both with Linux perf_events and Intel VTune. It has also been reported elsewhere by other people.

A better and more honest visualization of collected results would have summed up all samples inside every basic block, and demonstrated the resulting value associated with a basic block, not its individual instructions. Not 100% fool-proof but a bit better and honest,

Or is there some option I need to use to obtain reliable results?

I do not know if newer profiling hardware, namely tools based on Intel Processor Trace (available starting from Broadwell, but improved in Skylake) instead of older PEBS, would give more accurate data. I guess one needs to experiment with such tools first.

Icterus answered 11/2, 2018 at 12:2 Comment(3)
Intel PT does look like it should be excellent for timing of basic blocks, but it won't give you a breakdown of instructions within a block. By design it only records branches, but it does record a timestamp on events. So it should be great for profiling first-run cold-cache effects, not just steady-state, because you have a log of timestamps at each branch, not just statistical average over all executions of a block.Diaconal
High counts for "fast" instructions isn't always a problem, it's an expected consequence of how the hardware works. So the "problem" is misinterpreting the profiling results. See comments on the question.Diaconal
VTune and perf with cycles:ppp usually give very accurate results in general, unless they are on some old hardware that doesn't support PEBS. Using cycles:p also works well, but has a "deterministic skid" of one so all samples show on the next instruction. That's also fine (as long as you are aware) except in rare cases the instruction with the sample was a target of a jump.Weissman

© 2022 - 2024 — McMap. All rights reserved.