Return stack buffer?
Asked Answered
E

1

11

As I understood, Return Stack Buffer only supports 4 to 16 entries (from wiki: http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_function_returns) and is not pair of key-value(based on indexing by position of ret instruction). Is it true? What happens to RSB when context switch happens?

Suppose we got into 50 functions which aren't returned in a CPU with return stack buffer length of 16, what happens after it? Does it mean all predictions fail? Can you illustrate it? Is this scenario the same in recursive function calls?

Eustatius answered 5/12, 2012 at 12:8 Comment(3)
I think, return stack buffer is reseted at context switch. There is some info about RSB in the pdf from Agner: agner.org/optimize/microarchitecture.pdf section 3.14. RSB is a fixed length LIFO buffer (last in first out; also known as stack); in deep call stack older returns are pushed out from RSB and are not predicted. This technique will almost not help in case of deep recursion. PS. in section 3.1 in last point Agner says "information that the predictors have collected is often lost due to task switches and other context switches"Wandawander
As I understand it, the RSB is unaware of context switches: like osgx says it's a LIFO buffer that will just be "wrong" and mispredict upon a context switch, just as if a mismatched CALL or RET had been encountered.Bounteous
The most common case for very frequent calls / returns is shallow enough for a 16-entry "stack", although newer CPUs do make it somewhat deeper. (And some will fall back to standard indirect-branch prediction if the RSB is empty.)Bavaria
S
1

The BPU can contain its own RAS predictor, which pushes assumed call NLIPs (IP of the following instruction) onto the RAS stack when it predicts a call type in the BTB. The next return it predicts in the BTB will use the top of the RAS as the predicted address (like how when it predicts a regular indirect branch, a parallel hit in the ITA will outrank the target address in the BTB).

The BAC will verify / override these return target predictions at decode by pushing the NLIP of every call instruction to its own RSB, the next return address's prediction will be compared with this address. If incorrect, BAC will issue a BAclear and resteer the next IP logic at the start of the pipeline to the correct return address (which might turn out to be wrong id the RSB is corrupted). It probably overwrites the RAS predictor stack with the BAC RSB state.

In one implementation, the BAC provides the TOS pointer with every branch prediction it verifies, along with the fall through address. Once a branch is executed and the real result is known, if a misprediction occurs, the RSB TOS is restored. More efficient I think is having an architectural RSB at retirement, which is copied into the BAC RSB and RAS predictor upon a pipeline flush / misprediction. That prevents restoring to a corrupt RSB.

The RAS predictor is likely to be a circular stack which may or may not have overflow and underflow checks and guarantees depending on the implementation. A new prediction likely overrides the oldest prediction when the stack is full so that it is always up to date (rather than preventing it from being added when full, which would mean keeping a counter as to how many call / returns its unable to make prediction for). As for an underflow, it likely refuses to make a prediction, and instead it uses the ITA to make the prediction. If the RSB underflows, it probably doesn't override the prediction made by the RAS predictor.

A hardware interrupt for a context switch results in the pipeline being cleared when the final uop of a macroop executes. The RSB is likely restored to an architectural state for continuation after the interrupt. It is likely possible for the predictor RAS / BAC RSB to be flushed in microcode and if it becomes corrupted it eventually uncorrupts itself.

Screening answered 6/2, 2021 at 12:1 Comment(2)
@PeterCordes yes, but the length of the predicted call instruction isn't known at the stage when the BPU makes an early prediction. It would have to be indicated in the BTB entry for the call by the BTB update at decode or retireScreening
Oh I see. Are we sure the front-end (via the RSB) can predict a ret before the corresponding call has even been fully decoded? Assuming yes, is there any evidence that this can mispredict (or need a re-steer or something) if the call had prefixes or was otherwise hard to know the right length?Bavaria

© 2022 - 2024 — McMap. All rights reserved.