What is the stack engine in the Sandybridge microarchitecture?
Asked Answered
Q

2

20

I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues:

The dedicated stack pointer tracker is also present in Sandy Bridge and renames the stack pointer, eliminating serial dependencies and removing a number of uops.

What is a dedicated stack pointer tracker actually?

For Sandy Bridge (and the P4), Intel still uses the term ROB. But it is critical to understand that, in this context, it only refers the status array for in-flight uops

What does it mean in fact? Please make it clear.

Quickfreeze answered 14/4, 2016 at 18:50 Comment(0)
A
22
  1. Like Agner Fog's microarch doc explains, the stack engine handles the rsp+=8 / rsp-=8 part of push/pop / call/ret in the issue/rename stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core, the back-end).

    So the back-end only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from rsp when the 8bit displacement counter overflows, or when the OoO back-end needs the value of rsp directly (e.g. sub rsp, 8, or mov [rsp-8], eax after a call, ret, push or pop typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).

    Note that Agner's instruction tables show that Pentium-M and later decode pop reg to a single uop which runs only on the load port. But Pentium II/III decodes pop eax to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the back-end. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for a mov ebp, esp, or an address for mov eax, [esp+16].


  1. The P6 microarch family (PPro to Nehalem) stored the input values for a uop directly in the ROB. At issue/rename, "cold" register inputs are read from the architectural register file into the ROB (which can be a bottleneck, due to limited read ports. See register-read stalls). After executing a uop, the result is written into the ROB for other uops to read. The architectural register file is updated with values from the ROB when uops retire.

SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.

Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.

SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OoO back-end.

Ana answered 14/4, 2016 at 20:40 Comment(2)
@Gilgamesz: out-of-order CPU core. (huh, google doesn't give that for "ooo core", only for "ooo cpu". The "ooo core" is part of a uop's lifetime between the "issue/rename" and "retirement" stages, where uops are in the ROB. See realworldtech.com/haswell-cpu (and his earlier SnB writeup) for diagrams.Ana
As soon as I start reading the first line of an answer I can tell when It's written by Peter Cordes, just brilliant insight.Pebbly
T
0

The stack machine is kind of like another execution/memory port. As Fog says:

The modification of the stack pointer by PUSH, POP, CALL and RET instructions is done by a special stack engine. ... This relieves the pipeline from the burden of μops that modify the stack pointer.

So that's taking care of the rsp+=8 / rsp-=8 arithmetic. They get handled by the stack machine without competing for execution port resources. But there's more.

The 16 deep hardware return address stack (Section 3.4.1.4 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual) is a fast shadow of the return addresses. It showed up in Pentium M. It is also used return prediction. Search Fog's Microarchitecture doc for "return stack buffer" for a little but not a lot more.

So now you have some nice HW to reduce execution port contention for stack arithmetic and a fast cache return address values. You can make the stack machine's life difficult by trying to outsmart it. Basically, always match calls/rets and pushes and pops. Then you're good to go.

Telethermometer answered 12/9, 2016 at 21:28 Comment(4)
Using pop after push doesn't matter if you used mov rbp, rsp, or [rsp+8] for a local, or anything like that between the push and pop. Any explicit use of the stack pointer forces the stack engine to insert an extra uop to update the OOO core's value. It's more like: after a call (which should return with a ret), it will be cheaper to POP once than to add rsp, 8, as well smaller code-size.Ana
You're lumping the return-address predictor together with the stack engine, and that's more confusing than helpful, IMO. They're orthogonal to each other; each could exist without the other, and you can observe their effects independently. Real code breaks the stack engine all the time, but using add esp, 16 instead of 4 pop instructions, or with push rbx / sub rsp, 128 to reserve stack space near the start of a function after saving a register to be restored later. Minimizing the extra uops it has to insert is useful, but not very important. But don't break call/ret pairing!Ana
@PeterCordes "but using" I can't parse that sentence. By using?Discourtesy
Oh, I think I left out the end of the sentence: but using add/sub instead of just push/pop is worth it to save total uops for larger changes to E/RSP, even though it results in a stack-sync uop for explicit (not implicit) access to E/RSP in back-end.Ana

© 2022 - 2024 — McMap. All rights reserved.