Can the LSD issue uOPs from the next iteration of the detected loop?
Asked Answered
S

1

9

I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop:

BITS 64
GLOBAL _start

SECTION .text

_start:

 mov ecx, 10000000

.loop:

 dec ecx             ;|
  jz .end            ;| 1 uOP (call it D)

jmp .loop            ;| 1 uOP (call it J)

.end:
 mov eax, 60
 xor edi, edi
 syscall

Using perf we see that the loop runs at 1c/iter

Performance counter stats for './main' (50 runs):

        10,001,055      uops_executed_port_port_6   ( +-  0.00% )
         9,999,973      uops_executed_port_port_0   ( +-  0.00% )
        10,015,414      cycles:u                    ( +-  0.02% )
                23      resource_stalls_rs          ( +- 64.05% )

My interpretations of these results are:

  • Both D and J are dispatched in parallel.
  • J has a reciprocal throughput of 1 cycle.
  • Both D and J are dispatched optimally.

However, we can also see that the RS never gets full.
It can dispatch uOPs at a rate of 2 uOPs/c at most but can theoretically get 4 uOPs/c, leading to a full RS in about 30 c (for an RS with a size of 60 fused-domain entries).

To my understanding, there should be very few branch mispredictions and the uOPs should all come from the LSD.
So I looked at the FE:

     8,239,091      lsd_cycles_active ( +-  3.10% )
       989,320      idq_dsb_cycles    ( +- 23.47% )
     2,534,972      idq_mite_cycles   ( +- 15.43% )
         4,929      idq_ms_uops       ( +-  8.30% )

   0.007429733 seconds time elapsed   ( +-  1.79% )

which confirms that the FE is issuing from the LSD1.
However, the LSD never issues 4 uOPs/c:

     7,591,866      lsd_cycles_active ( +-  3.17% )
             0      lsd_cycles_4_uops 

My interpretation is that the LSD cannot issue uOPs from the next iteration2 thereby only sending D J pairs to the BE each cycle.
Is my interpretation correct?


The source code is in this repository.


1 There is a bit of variance, I think this is due to the high number of iterations that allows for some context switch.
2 This is sound quite complex to do in hardware with limited circuits depth.

Slumberous answered 28/8, 2018 at 9:32 Comment(23)
We know from Is performance reduced when executing loops whose uop count is not a multiple of processor width? that the LSD does issue groups that include the loop-branch uop and the first uops, for loops that aren't a multiple of 4. It's possible that on first-gen SnB a loop-branch ends an issue group, but we know HSW / SKL isn't like that. Unfortunately my Intel mobo's BIOS-update feature bricked my SnB before that question was posted so I can't double-check my old results / conclusions where my testing procedures were based on some assumptions.Highline
It it is curious that a large fraction of the cycles are coming from the MITE (legacy decoder) and a fair number from the DSB for this assembly-only program that just does a tight loop. I have seen the effect on Skylake for very small loops, and sometimes it is very erratic (e.g,. back to back runs the numbers for MITE/LSD/DSB etc might change wildly). It doesn't seem to correlate with performance. I wonder if the counters are just wrong or there is some other weird effect. I recall it goes away as the loops get longer (then you get close to 100% out of LSD/DSB depending on size).Trott
Since nobody's mentioned it on this question, beware that Skylake / Kaby Lake with up-to-date microcode have their loop-buffer disabled, to fix an erratum. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent). It's fixed in Coffee Lake (en.wikichip.org/wiki/intel/microarchitectures/…). On CPUs with a disabled LSD, tiny loops just run from the uop cache (DSB).Highline
Margaret - @PeterCordes already linked this question above, but I wanted to mention one thing from there that isn't interesting enough to get its own answer here. There is an effect along the lines you are asking about here when the LSD is used: it seems that ops from the end and beginning of the loop in the LSD cannot be issued in the same cycle. This would exactly answer your question (with "no") if the "loop in the LSD" was the same as an iteration of the loop, but it seems that usually it isn't: the results on that question are ...Trott
... consistent with the loop body actually appearing "unrolled" several times in the LSD (if it is small enough), and so the "break" where it wraps around and can't issue from the head and the tail only appears every N iterations, where N depends on the size of the loop. A small loop like this will fit many times in the LSD so the effect wouldn't be strong (Hadi's answer seems to explain it, but you could certainly test it by getting rid of one jmp). This explanation also makes sense if you think of the LSD as simply being a re-use of the IDQ: the normal mode of operation of the IDQ ...Trott
... (not using LSD) is that the RAT consumes a number of consecutive uops from the queue and if there is any wrapping it is always at the same spot (at the ends of the actual full IDQ) which, in hardware, could presumably be wired so there is no hiccup. In LSD mode the IDQ is essentially frozen and the instructions are re-used in-place by moving the head pointer around, but it makes sense that when it is time to wrap the RAT can't can read uops from two non-contiguous locations in the IDQ (at least with complicating the hardware). These are guesses based on current "best guess" LSD behavior.Trott
@Trott Wow! That's deep! So, can I think of the IDQ as circular buffer with a Head index (for the RAT) and a Tail index (for the MITE/DSB/MSROM)? When in LSD mode no new uOPs are added at the tail but the head moves around (and of course, only physically consecutive entries can be read, including wrapping around the physical buffer). When non in LSD mode the RAT moves the HEAD and the "other side" of the FE the tail.Slumberous
@MargaretBloom - yes, you can think of it that way. That's exactly how I think of it anyways. I don't have any inside info, but it explains many of the observed behaviors and I haven't seen anything contradicting it (e.g., in patents). A lot of my mental model came from the investigation on that question and it was Peter who pointed out the "unrolling" thing at which point all the numbers fell into place.Trott
You can also kind of sus out some of this info by reading the description of certain hardware performance counters, which talk about LSD, DSB and MITE events. From those events it seems pretty clear that the LSD is not a separate thing that can feed into the IDQ, but it is the IDQ.Trott
@Trott This is a superfluous comment but I want to thank you anyway, because your clear words made me see. And things becomes simple when you can see them :)Slumberous
@Trott You didn't need to sus that out by yourself. At the end of the page immediately preceding the one I linked to regarding the BOB, there is this quote: One difference in Haswell’s decoding path is the uop queue, which receives uops from the decoders or uop cache and also functions as a loop cache.Alpenglow
@IwillnotexistIdonotexist - good catch! David's articles are a really great source for various bits of knowledge even if you read them already several times :).Trott
Thanks @MargaretBloom! The superfluous comment is appreciated.Trott
@Trott how does the "unrolling" affect in the LSD fit with the mental model of replaying the LSD == moving head pointer in IDQ?Polychasium
@Polychasium - this question and answers explain it better than I can in a comment, but basically it fits. Actually, the way unrolling affects performance was once of the main reasons unrolling was hypothesized in the first place.Trott
Basically the LSD behaves as if there are N unrolled copies (with N consistent based on the known LSD sizes) in the IDQ, and where up to 4 uops can be renamed per cycle even across iterations, as long as they are contiguous in the IDQ, but not across the "last and first" iterations, which makes sense if you think about how the hardware may work.Trott
That is, if you have a loop with 3 uops and it is unrolled 5 times (in reality it would be unrolled may more times to close to the max size of the LSD), the uops would be like: AAA BBB CCC DDD EEE in the LSD. They would be renamed like: AAAB BBCC CDDD EEE, i.e. 4 cycles for 5 iterations. Note the last group has only 3 uops, not 4, since EEE can't combined with the A (last and first group).Trott
Throughput is 3.75 uops, less than the max of 4, but better than the scenario if there was no unrolling in which case only 1 iteration (3 uops) could be renamed every cycle. If these same uops came out of the DSB, however, they'd run at 4/cycle (barring any other bottlenecks and assuming "suitable" DSB packing).Trott
@Trott that makes sense, but what I don't get (again regarding the mental model) is the following. I assume IDQ HEAD is being moved with the following function: Loop_Start + (Loop_Uop_Number % Loop_Size). I don't see where sizeof(LSD) fits in that function / model so I don't really see why AAAB would be allowed but EEEA wouldn't. What makes sense is that the decoded uops are in the LSD and the LSD has its own HEAD which can't send non-contiguous uops to the RAT. (Possibly the misunderstanding is I am treating IDQ is proper noun and you are treating it is description of any decoded uops?)Polychasium
@Noah, it's not sizeof LSD that matters it's "is the number number of uops a multiple of 4". If the locked uops are a multiple of 4, there is no "wraparound loss" since the wraparound point falls at a rename boundary anyway. Also, I think it's plausible the LSD-locked loops always start at entry 0 in the LSD, so you don't need the Loop_Start term. This would seem to simplify the hot path hardware. I don't know if it's easy to distinguish the two cases.Trott
Think of the IDQ as a 64 entry hardware array. Then imagine the rename hardware can, in one cycle, take 4 consecutive uops from this array. E.g. uops 0-3 or 60-63. Maybe they need to be aligned, or maybe eg 2-5 is allowed too (doesn't matter for this experiment). Maybe the buffer physically "wraps around" so that 63 and 0 are adjacent or maybe it doesn't, again doesn't matter. In the DSB case, the head is constantly rotating around the whole buffer in steps of 4, at least when FE and execution can keep up: the boundary always falls between a rename group regardless of the looping structure.Trott
In the LSD case, however it doesn't loop around the whole buffer (usually) but has to wrap early. In the example above AAAB are in entries 0-3 so they can be processed in one cycle as usual. EEEA, however are in entries 12-14 and 0. So they are not consecutive in any sense, and it would require a lot more hardware to allow two arbitrary runs of entries to be delivered to the RAT in the same cycle. Even if AAAA didn't start at 0 you run into a similar issue at the "wrap" point.Trott
One question is why Intel doesn't choose a smarter unroll factor, rather than just the largest possible unroll. The waste decreases on average with larger unrolls but often the largest is not the best. E.g. if you unroll by 4 any time you can you get zero wraparound waste. Maybe it's hard to implement or something.Trott
G
7

All of the uops in your loop are branches (2 per iteration). I think that the reason that `lsd_cycles_4_uops is zero is because of a limitation in the renamer. According to the Intel Optimization Manual Section 2.4.3.1:

The renamer can allocate two branches each cycle, compared to one branch each cycle in the previous microarchitecture. This can eliminate some bubbles in execution.

That is a subsection of a section on the Sandy bridge microarchitecture. But to my knowledge, this applies to all later microarchitectures. The maximum renaming throughput is 4 uops per cycle. But at most two uops can be branches. So in this example where all uops are branches, the LSD can never deliver more than 2 uops at any given cycle even in the first iteration of the loop.

Therefore, 2 branch uops will be allocated in the RS per cycle, and both (one predicated taken and one not taken) can be dispatched per cycle. So the RS occupancy does not grow.

This limitation does not impact the performance of your program. Executing 2 branch uops per cycle, giving an IPC of 3 per cycle, is already optimal.

I tried to find a performance event that can capture allocator stalls due to that limitation. The events RESOURCE_STALLS.ANY and UOPS_ISSUED.ANY (with cmask=1 and inv=1) don't seem to be relevant in this case. @IwillnotexistIdonotexist suggested to use IDQ_UOPS_NOT_DELIVERED.CORE. I present the results below for the performance event and all of its supported variants. I also provide the correct meaning of these events because the manual is wrong. T denotes the number of iterations.

IDQ_UOPS_NOT_DELIVERED.CORE: Counts the number of slots that were not utilized by the allocator. If the program ran for C core cycles, then the total number of slot is 4*C. The measured value is almost equal to 2*T. Since the the number of cycles is T, the number of slots is 4*T, which means that about half the issue slots were not utilized.

IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE: Counts the number of cycles where zero uops were delivered from the IDQ. The measured value is negligible.

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE: Counts the number of cycles where at most 1 uops were delivered from the IDQ. The measured value is negligible.

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE: Counts the number of cycles where at most 2 uops were delivered from the IDQ: The measured value is almost equal to T.

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE: Counts the number of cycles where at most 3 uops were delivered from the IDQ: The measured value is almost equal to T.

Therefore, since the execution time is almost equal to T core cycles, we can conclude that the allocator only allocates exactly 2 uops per cycle in most cycles., which is equal to the dispatch rate.

Note that the RS in Haswell and Skylake holds unfused uops. So each entry can hold a single unfused uop. See Footnote 2. But this doesn't matter here because there is no microfusion.

Galantine answered 28/8, 2018 at 22:44 Comment(8)
Perhaps idq_uops_not_delivered.core?Alpenglow
@IwillnotexistIdonotexist IDQ_UOPS_NOT_DELIVERED.CORE counter value is about equal to UOPS_ISSUED.ANY, which is the total number of uops issued (T*2 where T is the number of iterations). But shouldn't the count be about (2/3)*T*4?Galantine
Actually, that's exactly what I was expecting: 2T undelivered uops. Recall that on Haswell, the decoders perform macrofusion of uops, and so the dec+jz and jmp constitute two uops for the purpose of counting deliveries from the IDQ to the RAT. Once the RS fills up to 48 branch uops, the IDQ will indeed fail to deliver 2 out of a possible 4 uops to the RAT every clock cycle, because while the RAT isn't stalled (it's got plenty of room for other stuff), the RAT cannot accept more than 48 branches in its branch buffer and it drains at 2 uops/cc.Alpenglow
@IwillnotexistIdonotexist Thanks, you're right. I thought the allocator would wait until it can allocate 4 uops, but I was wrong.Galantine
Worth mentioning that this smaller branch-order-buffer exists to enable fast recovery after branch mispredicts, allowing un-executed uops from before the mispredict to stay in the scheduler and keep executing, instead of flushing back to a known-good retirement state like earlier CPUs that didn't have a separate BOB.Highline
Shouldn't the lsd_cycles_4_uops counter be greater than zero anyway? Or by the time the LSD kicks in the BOB is already full?Slumberous
@MargaretBloom See the edit to the answer regarding your question.Galantine
Very interesting! Thank you very much!Slumberous

© 2022 - 2024 — McMap. All rights reserved.