I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 10000000
.loop:
dec ecx ;|
jz .end ;| 1 uOP (call it D)
jmp .loop ;| 1 uOP (call it J)
.end:
mov eax, 60
xor edi, edi
syscall
Using perf
we see that the loop runs at 1c/iter
Performance counter stats for './main' (50 runs):
10,001,055 uops_executed_port_port_6 ( +- 0.00% )
9,999,973 uops_executed_port_port_0 ( +- 0.00% )
10,015,414 cycles:u ( +- 0.02% )
23 resource_stalls_rs ( +- 64.05% )
My interpretations of these results are:
- Both D and J are dispatched in parallel.
- J has a reciprocal throughput of 1 cycle.
- Both D and J are dispatched optimally.
However, we can also see that the RS never gets full.
It can dispatch uOPs at a rate of 2 uOPs/c at most but can theoretically get 4 uOPs/c, leading to a full RS in about 30 c (for an RS with a size of 60 fused-domain entries).
To my understanding, there should be very few branch mispredictions and the uOPs should all come from the LSD.
So I looked at the FE:
8,239,091 lsd_cycles_active ( +- 3.10% )
989,320 idq_dsb_cycles ( +- 23.47% )
2,534,972 idq_mite_cycles ( +- 15.43% )
4,929 idq_ms_uops ( +- 8.30% )
0.007429733 seconds time elapsed ( +- 1.79% )
which confirms that the FE is issuing from the LSD1.
However, the LSD never issues 4 uOPs/c:
7,591,866 lsd_cycles_active ( +- 3.17% )
0 lsd_cycles_4_uops
My interpretation is that the LSD cannot issue uOPs from the next iteration2 thereby only sending D J pairs to the BE each cycle.
Is my interpretation correct?
The source code is in this repository.
1 There is a bit of variance, I think this is due to the high number of iterations that allows for some context switch.
2 This is sound quite complex to do in hardware with limited circuits depth.