Is there a penalty when base+offset is in a different page than the base?

Asked 16/9, 2018 at 6:1 Answered 16/9, 2018 at 22:10

Solved performance assembly x86 micro-optimization

The execution times for these three snippets:

pageboundary: dq (pageboundary + 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx - 8]
    sub ecx, 1
    jnz .loop

And this:

pageboundary: dq (pageboundary - 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 8]
    sub ecx, 1
    jnz .loop

And this:

pageboundary: dq (pageboundary - 4096)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 4096]
    sub ecx, 1
    jnz .loop

Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.

How does this effect work in general?

Routing the result from the load through an ALU instruction like this:

.loop:
    mov rdx, [rdx + 8]
    or rdx, 0
    sub ecx, 1
    jnz .loop

Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.

A test like this was used in response to some of the comments:

.loop:
    lfence
    ; or rdx, 0
    mov rdx, [rdx + 8]
    ; or rdx, 0
    ; uncomment one of the ORs
    lfence
    sub ecx, 1
    jnz .loop

Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.

Kienan answered 16/9, 2018 at 6:1 Comment(7)

That's weird. I don't think Intel's docs mention this failure for SnB-family's [base + 0..2047] special case 4-cycle load-use latency, but it's plausible that it's based on using the base reg to start a TLB check before an add, and is slower if it turns out they're in different pages. (And BTW, that special case is only when forwarding to another addressing mode, not to an ALU instruction.) – Nila 16/9, 2018 at 6:53

Yes inserting an ALU instruction into the dep chain decreases the total latency, which is pretty funny (like a negative-latency instruction) – Kienan 16/9, 2018 at 6:55

Feeding an ALU instruction always disables the 4-cycle pointer-chasing fast path. You'd get 6 cycles from that loop even without any page-crossing shenanigans, including with mov rdx, [rdx] / and rdx,rdx. – Nila 16/9, 2018 at 8:3

This is a really good find. I've added this effect to the Intel Performance Quirks page with links to the question and @PeterCordes' answer. – Hobbema 17/9, 2018 at 21:11

I tested this on Ryzen and didn't see any similar effect: the loop still executes at 4 cycles with the loads on different pages. Ryzen also doesn't have the restriction of the load address needing to come from a load: with a 1 cycle ALU added, the total latency goes up to 5 cycles (4 + 1), versus 6 cycles on Intel (since the load takes 5 cycles itself in that case). – Hobbema 17/9, 2018 at 23:4

I also tested this on KNL and SKX. The SKX results are exactly the same as SKL. KNL is very different: all loads are apparently 4 cycles, even with complex addressing, intervening ALU ops, etc. – Hobbema 18/9, 2018 at 16:52

@BeeOnRope: KNL's max clock frequency is lower, and its L1dTLB might be smaller (64 uTLB entries), so it makes some sense they need fewer pipeline stages in the load unit. – Nila 19/9, 2018 at 6:54

Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.

Any other way of ensuring that important struct members are in the same page as the start of the object will also work.

Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.

Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.

Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:

base reg comes from another load. (A rough heuristic for pointer-chasing, and usually means that load latency is probably part of a dep chain). If objects are usually allocated not crossing a page boundary, then this is a good heuristic. (The HW can apparently detect which execution unit the input is being forwarded from.)
Addressing mode is [reg] or [reg+disp8/disp32]. (Or an indexed load with an xor-zeroed index register! Usually not practically useful, but might provide some insight into the issue/rename stage transforming load uops.)
displacement < 2048. i.e. all bits above bit 11 are zero (a condition HW can check without a full integer adder/comparator.)
(Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).

I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.

If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.

microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):

It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)

Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.

As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).

Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).

I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)

Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.

 (Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type             |  Base + Offset > 2048    | Base + Offset < 2048
                      |  Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer               |            5             |  4
MMX, SSE, 128-bit AVX |            6             |  5
X87                   |            7             |  6
256-bit AVX           |            7             |  7
 (remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)

The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).

Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.

@Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.

Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)

We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.

%define off  16
    lea    rdi, [buf+4096 - 16]
    mov    [rdi], rdi
    mov    [rdi+off], rdi

    mov     ebp, 100000000
.loop:

    and    rdi, rdi
    mov    rdi, [rdi]        ; base comes from AND
    mov    rdi, [rdi+off]    ; base comes from a load

    dec   ebp
    jnz  .loop

    ... sys_exit_group(0)

section .bss
align 4096
buf:    resb 4096*2

Timed with Linux perf on SKL i7-6700k.

off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).
off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)

With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.

Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (@harold confirms HSW also gets 8 here).

Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (@harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)

It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.

On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.

Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.

Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.

On Ice Lake, this special 4 cycle fast path for pointer chasing loads is gone: GP register loads that hit in L1 now generally take 5 cycles, with no difference based on the presence of indexing or the size of the offset.

Nila answered 16/9, 2018 at 7:15 Comment(23)

Sandy Bridge actually has a performance event, AGU_BYPASS_CANCEL.COUNT whose name and description pretty much explains the effect: This event counts executed load operations with all the following traits: 1. addressing of the format [base + offset], 2. the offset is between 1 and 2047, 3. the address specified in the base register is in one page and the address [base+offset] is in an. (yes, it ends abruptly like that). The "between 1" part seems wrong since as you point out it happens even for zero offsets. – Hobbema 13/10, 2018 at 3:41

I think I found the Intel patent that describes this particular optimization. It's pretty old. It says: "The invention has the advantages that it improves the "base-plus-displacement/offset" and the "scaled-index-plus-displacement" addressing modes by one clock most of the time." We have verified this for base+disp but not the latter. Also it's not clear to me how the terms disp and offset are used in the patent. – Venepuncture 26/11, 2018 at 2:1

On Sandy Bridge, the fast path is 4c, the slow path is 5c, and the mispredict path is 9c, just like Haswell. On Ivy Bridge, the fast path is 4c, the slow path is around 4.6c, and the mispredict path is around 8.7c. On both, indexed addressing mode does not use the 4c path even if the index register is zeroed using a zeroing idiom. I don't know how to interpret the IvB numbers. On both, the AGU_BYPASS_CANCEL performance event can be used to count fast path mispredicts. This counter does not work on Haswell. – Venepuncture 14/1, 2019 at 20:34

BTW, I have now temporary access to IvB and SnB (and hopefully soon Westmere). So if we need to run any experiments on them, let me know. – Venepuncture 14/1, 2019 at 20:35

@PeterCordes will the 4c optimized load be interrupted by an interleaved store? i.e (at&t) movl (reg0), reg0; movl reg1, (reg1); movl (reg0), reg0. Wondering if memory disambiguation will prevent this optimization. (Only have ICL so can't test). – Scallop 18/2, 2021 at 20:4

@Noah: I'd assume no interaction with the 4c special case. You'd either get a 4c load from L1d cache on no overlap, store-forwarding (presumably also 4c from load-address to data, and the usual 3-5c variable from store-data to load-data) on full overlap, a store-forwarding stall on partial overlap, or a machine_clear.memory_ordering if the hardware guesses wrong about whether to forward or not. (Mispredicts the memory dependency). I don't expect you could provoke a 5c load latency; having an outstanding unknown-address store is probably a fairly normal situation over a 224 uop ROB. – Nila 18/2, 2021 at 21:33

@PeterCordes ran uarch-bench on ICL and see a few relevant things regarding the numbers in this post: Simple addressing pointer chase 3.30 2.54, Simple addressing chase, half diffpage 3.35 2.58, Simple addressing chase, different pages 3.38 2.60, Simple addressing chase with ALU op 4.02 3.10. Seems 1) that the time has been reduced from 4c to closer to 3.3-3.4c. 2) Alu ops are true 4c. 3) the [0..2047] bound has expanded. – Scallop 2/3, 2021 at 21:35

@Noah: the [0..2047] bound has expanded - is there any evidence of there still being a special case for small offsets at all? Otherwise it sounds like they just made simple base+offset addressing modes faster overall. Are other addressing modes, like [rdi + rsi + 16] different? Also, the 1st number is core cycles, 2nd number is ref cycles, right? So we only care about the first number. – Nila 3/3, 2021 at 2:54

@PeterCordes - the first column is (true) cycles, the second columns in nanoseconds. – Hobbema 3/3, 2021 at 3:50

@Scallop - the results are "too good" for some of those results: the minimum load latency is 5 cycles on ICL, even with simple addressing, barring "memory renaming". Probably what is happening is that memory renaming is kicking in and at least part of the test runs by loading the value from the register file rather than actually doing a load. I'll try to adjust it to defeat memory renaming. – Hobbema 3/3, 2021 at 4:0

@BeeOnRope: Oh, does ICL borrow that zero-latency store forwarding trick from Zen 2? (when the same addressing mode with the same register is used) agner.org/forum/viewtopic.php?t=41 – Nila 3/3, 2021 at 4:5

I doubt either borrows from the other as they were released at almost exactly the same time, but yes Ice Lake has memory renaming. – Hobbema 3/3, 2021 at 4:9

After this change memory renaming is defeated and the results look much more sane on Ice Lake. @Scallop – Hobbema 3/3, 2021 at 4:14

@BeeOnRope: IDK how they coordinate their patent-sharing, whether sharing is plausible here or whether it's more likely independent invention. There's enough lead-time in CPU designs for it to be plausible to incorporate late an idea the other company had in the early stages of their design, at least if the idea was disclosed early enough via a patent. – Nila 3/3, 2021 at 4:14

I bet it is independent invention. I don't think they are sharing their patents ahead of time like that so their designs are similar. Memory renaming has been talked about for years in academia and elsewhere. I don't know if these are even the first chips to do it. The Ice Lake implementation looks quite different: not based on "addressing expression matching" like Zen 2. – Hobbema 3/3, 2021 at 4:16

So I should add that on Ice Lake the 4-cycle opt is gone: most loads of GPs regs (barring things like cross cache line, segment prefix etc) take 5 cycles. So the test results no longer show any penalty for loads that fall in another page after the offset is added. – Hobbema 3/3, 2021 at 5:14

@Hobbema I'm running the following test on ICL (at&t):

movl $100000, %%eax\n\t; 1:\n\t; movl %%esi, (%%rdi)\n\t; addl $5, (%%rdi)\n\t; movl (%%rdi), %%ecx\n\t; orl %%ecx, %%esi\n\t; decl %%eax\n\t; jnz 1b\n\t;

which I think should demonstrate "memory renaming" if it exists but see 15 cycle / iteration 80% of the time (12 the other 20%). If ICL had memory renaming wouldn't I expect 3 cycles / iteration? – Scallop 3/3, 2021 at 17:6

@Scallop - I'm not sure why that example doesn't run faster on ICL, but there is memory renaming on that uarch as shown by other examples. – Hobbema 6/3, 2021 at 20:47

@Hobbema do you know if memory aliasing can occur between load/store of GPR and vector register? Aka movl %eax, (%rdi), vmovd (%rdi), %xmm0 or vice versa? – Scallop 17/6, 2021 at 2:4

@Scallop - yes, sure. I mean in the sense that it's the same memory so actual aliasing can by definition occur: vector loads must see GP stores that overlap and vice-versa, for correctness. Or are you asking about whether forwarding occurs? I believe it does (efficiently) for GP loads hitting vector stores. The other way around is a stall because vector loads are wider than GP stores, so you get the partial load stall. – Hobbema 17/6, 2021 at 5:42

@Hobbema re: "The other way around is a stall because vector loads are wider than GP stores, so you get the partial load stall": What about stores from a vector register but of same size as the GP? (example above has vmovd, not vmovdqu) – Scallop 17/6, 2021 at 5:50

@Scallop - I don't recall specifically testing it but I suspect it should be fine but with 1 or 2 extra cycles of latency (for data). IIRC there are some vector forwarding tests in uarch-bench that could be adapted. – Hobbema 17/6, 2021 at 8:28

@Noah: GCC with some tuning options likes to use scalar store / movd reload as part of _mm_set_epi32. This generally sucks but isn't a total disaster (bit higher latency, but less ALU port pressure); if it caused a store-forwarding stall on most CPUs, it would have been fixed sooner! Or not, _mm_set_epi64x on 32-bit really does create store-forwarding stalls with 64-bit reloads of two 32-bit stores: gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 reported 4 years ago, still not fixed gcc.godbolt.org/z/PGofje4G3. (GCC's optimizer doesn't "know about" SF stalls.) – Nila 17/6, 2021 at 8:55

I've conducted a sufficient number of experiments on Haswell to determine exactly when memory loads are issued speculatively before the effective address is fully calculated. These results also confirm Peter's guess.

I've varied the following parameters:

The offset from pageboundary. The offset used is the same in the definition of pageboundary and the load instruction.
The sign of the offset is either + or -. The sign used in the definition is always the opposite of the one used in the load instruction.
The alignment of pageboundary within the executable binary.

In all of the following graphs, the Y axis represents the load latency in core cycles. The X axis represents the configuration in the form NS1S2, where N is the offset, S1 is the sign of the offset used in the definition, and S2 is the sign used in the load instruction.

The following graph shows that loads are issued before calculating the effective address only when the offset is positive or zero. Note that for all of the offsets between 0-15, the base address and the effective address used in the load instruction are both within the same 4K page.

The next graph shows the point where this pattern changes. The change occurs at offset 213, which is the smallest offset where the base address and the effective address used in the load instruction are both within different 4K pages.

Another important observation that can be made from the previous two graphs is that even if the base address points to a different cache set than the effective address, no penalty is incurred. So it seems that the cache set is opened after calculating the effective address. This indicates that the L1 DTLB hit latency is 2 cycles (that is, it takes 2 cycles for the L1D to receive the tag), but it takes only 1 cycle to open the cache's data array set and the cache's tag array set (which occurs in parallel).

The next graph shows what happens when pageboundary is aligned on a 4K page boundary. In this case, any offset that is not zero will make the base and effective addresses reside within different pages. For example, if the base address of pageboundary is 4096, then the base address of pageboundary used in the load instruction is 4096 - offset, which is obviously in a different 4K page for any non-zero offset.

The next graph shows that the pattern changes again starting from offset 2048. At this point, loads are never issued before calculating the effective address.

This analysis can be confirmed by measuring the number of uops dispatched to the load ports 2 and 3. The total number of retired load uops is 1 billion (equal to the number of iterations). However, when the measured load latency is 9 cycles, the number of load uops dispatched to each of the two ports is 1 billion. Also when the load latency is 5 or 4 cycles, the number of load uops dispatched to each of the two ports is 0.5 billion. So something like this would be happening:

The load unit checks whether the offset is non-negative and smaller than 2048. In that case, it will issue a data load request using the base address. It will also begin calculating the effective address.
In the next cycle, the effective address calculation is completed. If it turns out that the load is to a different 4K page, the load unit waits until the issued load completes and then it discards the results and replays the load. Either way, it supplies the data cache with the set index and line offset.
In the next cycle, the tag comparison is performed and the data is forwarded to the load buffer. (I'm not sure whether the address-speculative load will be aborted in the case of a miss in the L1D or the DTLB.)
In the next cycle, the load buffer receives the data from the cache. If it's supposed to discard the data, it's discarded and it tells the dispatcher to replay the load with address speculation disabled for it. Otherwise, the data is written back. If a following instruction requires the data for its address calculation, it will receive the data in the next cycle (so it will be dispatched in the next cycle if all of its other operands are ready).

These steps explain the observed 4, 5, and 9 cycle latencies.

It might happen that the target page is a hugepage. The only way for the load unit to know whether the base address and the effective address point to the same page when using hugepages is to have the TLB supply the load unit with the size of the page being accessed. Then the load unit has to check whether the effective address is within that page. In modern processors, on a TLB miss, dedicated page-walk hardware is used. In this case, I think that the load unit will not supply the cache set index and cache line offset to the data cache and will use the actual effective address to access the TLB. This requires enabling the page-walk hardware to distinguish between loads with speculative addresses and other loads. Only if that other access missed the TLB will the page walk take place. Now if the target page turned out to be a hugepage and it's a hit in the TLB, it might be possible to inform the load unit that the size of the page is larger than 4K or maybe even of the exact size of the page. The load unit can then make a better decision regarding whether the load should be replayed. However, this logic should take no more than the time for the (potentially wrong) data to reach the load buffer allocated for the load. I think this time is only one cycle.

Venepuncture answered 16/9, 2018 at 22:10 Comment(16)

The next sentence in Intel's manual after "can be" is "However, overall latency varies depending on the target register data type due to stack bypass". This very much gives the impression they only said can because it only applies to GP integer. The table does explicitly say that GP integer loads with that addressing mode are 4 cycles, not 4 or 9 cycles. I don't think Intel's weasel words were sufficient to make their manual not wrong for HSW. I'm curious whether we still have the same effect on first-gen SnB, which is what's being documented in that part of the manual. – Nila 16/9, 2018 at 23:4

@PeterCordes I don't have an SnB to run experiments on. But I have a Broadwell, so I might conduct the same experiments there too and see if there is any difference form Haswell. Unlikely though. – Venepuncture 16/9, 2018 at 23:7

It's unlikely BDW is different. There aren't many changes from HSW. I wish I still had access to a SnB machine to test this and a few other things. I tried SKL and found some more interesting things (like a dependency on whether the previous load failed the fast-path. See my updated answer. – Nila 17/9, 2018 at 0:41

HW page walk is not microcoded; there is dedicated page-walk hardware that does its own cache loads separate from the load ports. What happens after a L2 TLB miss?. Fun fact: in P5 and earlier, the page-walk hardware bypassed the cache (so trapping to a software page walk was actually faster), but P6-family's page walker does cached loads. Are page table walks cached? – Nila 17/9, 2018 at 2:51

BTW, your graphs would be easier to follow if they weren't alternating positive/negative. We know from previous experiment and Intel's manuals that there's never anything weird about [base - constant], so these sawtooths are unexpected / hard-to-follow. You have to carefully read the legend to distinguish +- from -+, and I wouldn't have been able to easily follow which was which if I didn't already know that only positive displacements (negative relative offset in your terminology) could ever be 4 or 9. Especially since the titles just say 0..n, it's unexpected for that to be magnitude. – Nila 17/9, 2018 at 3:45

In your new last paragraph, I'm not sure what point you're making about TLB misses and page walks. I think you have multiple points here. 1. on TLB miss, we have to send the correct address to the page walker, not the speculative one. But mis-speculation can be detected before the first TLB check even completes, as you say in a single cycle (checking for carry-out into the page number from an add it had to do anyway). Oh, and I think you're saying that on mis-speculation it might avoid fetching the data+tags for that set of the VIPT L1d cache? Makes sense, good power optimization. – Nila 17/9, 2018 at 3:53

And 2. you're making the point that if the TLB check included page sizes, it could maybe avoid a replay on crossing a 4k boundary inside a hugepage, but I didn't follow the last sentence. – Nila 17/9, 2018 at 3:56

@PeterCordes We know from experiments that the load unit waits for the speculative-address load to complete before issues the load with the correct address. If TLB could supply the load unit with the page size, it can only do that in the third cycle. But in the fourth cycle, the data will reach the load buffer and it is at this point, ideally, the load unit should either discard the results and replay the load or accept the data if the speculated address is correct. So there is only one cycle between receiving the page size and the data. The load unit should determine what to do in one cycle. – Venepuncture 17/9, 2018 at 4:3

Otherwise, the replayed load will have a higher latency. That is, the cost of address mispredit will be higher. I'm not sure if this can be compensated by the benefit of supplying the page size to the load unit. See also https://mcmap.net/q/14363/-address-translation-with-multiple-pagesize-specific-tlbs. The TLB does not have to waste capacity on storing page sizes. – Venepuncture 17/9, 2018 at 4:4

If you want to avoid replay, maybe the TLB result could include an "also valid for next 4k" bit, and then you only have to replay if that's false and there was a carry into the page number. Then the actual page size doesn't have to leave the TLB for separate checking. I haven't taken the time to really grok your size-mask answer for multiple page sizes in the same TLB. I remembered it exists, but IDK how easy it would be for HW to produce a valid-for-next-4k result, too. Could maybe be useful for page-split loads/stores as well. – Nila 17/9, 2018 at 4:8

@PeterCordes But the 16- or 32-bit displacement can go beyond the next 4K. The total number of page sizes is currently 4 (4K, 2M, 4M, 1G), and Intel is adding few more page sizes. So it takes about 3 bits to inform the load unit of the exact page size. Hopefully the check then can be performed within one cycle. But yes, if we are only going to support a max of 2048 offset, then your valid for next 4k bit is good enough. – Venepuncture 17/9, 2018 at 4:12

Oh, you're imagining applying this fast-path for larger offsets within a 2M or 1G hugepage. I don't see a way to make that happen without hurting load latency for many common cases. Extra TLB read ports cost power and transistors, so it would cost power if attempting this in more cases isn't going to hurt throughput as well. The main benefit is when pointer-chasing is a loop-carried dependency, and the majority of cases are with small offsets. – Nila 17/9, 2018 at 4:21

@PeterCordes Yes it depends on the type of workloads the processor is expected to run. Your "also valid for next 4k" bit is more plausible in general. – Venepuncture 17/9, 2018 at 4:25

@PeterCordes I've the run the experiments on Broadwell. Got the same results as on Haswell. – Venepuncture 17/9, 2018 at 18:15

@HadiBrais if I understand correctly you are saying the 4c vs 5c optimization is made on step 1. In the 4c case it spends 1 cycle checking that 0 <= offset <= 2047 and issuing the load Vs. in the 5c case it spends 2 cycles calculating the address (assuming a dTLB hit) before issuing? So the difference is whether in 4c case step 2 is a check verifying it predicted the correct address or in 5c case the actual issuance of the load? – Scallop 18/2, 2021 at 20:30

@Scallop If the segment base address is zero, it takes one cycle to calculate the effective address in any addressing form. However, the check 0 <= disp <= 2047 can be done during decoding in the frontend before dispatch (so there may be a bit in the uop encoding to specify whether it's eligible for fast path execution). The difference of 4c vs. 5c stems from when the load is issued to the TLB and DC. If the check is true, it's issued in the first cycle. Otherwise, the effective address is calculated in the first cycle and then issued in the second cycle. Hence, a difference of one cycle. – Venepuncture 21/2, 2021 at 12:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags