Why not just predict both branches?
Asked Answered
L

1

9

CPU's use branch prediction to speed up code, but only if the first branch is actually taken.

Why not simply take both branches? That is, assume both branches will be hit, cache both sides, and the take the proper one when necessary. The cache does not need to be invalidated. While this requires the compiler to load both branches before hand(more memory, proper layout, etc), I imagine that proper optimization could streamline both so that one can get near optimal results from a single predictor. That is, one would require more memory for loading both branches(which is exponential for N branches), the majority of the time one should be able to "recache" the failed branch with new code quickly enough before it has finished executing the branch taken.

if (x) Bl else Br;

Instead of assuming Bl is taken, assume that both Bl and Br are taken(some type of parallel processing or special interleaving) and after the branch is actually determined, one branch is then invalid and the cache could then be freed for use(maybe some type of special technique would be required to fill and use it properly).

In fact, no prediction circuitry is required and all the design used for that could be, instead, used to handle both branches.

Any ideas if this is feasible?

Lucia answered 3/4, 2018 at 3:56 Comment(9)
I have a feeling the (probably significant) extra resources allocated to this parallel speculative execution would be better utilized elsewhere. Try implementing some common branch predictors and compare it to your method (with reasonable constraints, otherwise it's essentially cheating). I'm sure this idea has been explored before. Wikipedia has a paragraph on "eager execution" as a form of speculative execution, you may want to look at the source for that section.Cardholder
Possible duplicate of #26455448 , although the nice answer below is definitely a keeper. Anyway, the problem is that this explodes exponentially. Also read about predication which is essentially that.Mcneill
Instruction cache never needs to be invalidated. But I think when you say "cache", you actually mean "execute". But your main paragraph barely makes any sense, even if you replace "cache" with "re-order buffer" or other internal structures that CPUs use to track in-flight instructions for speculative + out-of-order execution.Chromogenic
@PeterCordes It makes sense to others... maybe the problem is you don't know what you are talking about enough for it to make sense? Cache has many meanings... maybe you should learn a few others? If you knew anything about branch prediction you would know that when the prediction fails, any number of caches may be invalidated: users.elis.ugent.be/~leeckhou/papers/ispass06-eyerman.pdf The problem with some people is that they think they know more than they actually do... Maybe you should not try so much to fit what I say in to your own memory model that is inferior?Lucia
That's an interesting paper, thanks for the link. But it's talking about branch misses being similar in cost to I-cache misses; i.e. stalling / flushing the pipeline so many front-end cycles are wasted (either unused because stalled, or wasted on the wrong path). They don't talk about branch misses causing any L1I invalidation at all. It's always possible you're just using terminology I'm not familiar with, but I think it's more likely that your question is mostly bogus and Hadi just answered re: the general principle of fetching (and maybe executing) down both sides.Chromogenic
If you see anything wrong with my description of out-of-order speculative execution on modern Intel CPUs (and how / why that makes Meltdown possible), let me know. Detecting a branch mispredict only requires discarding any in-flight instructions from the wrong path, and starting to fetch/decode from the right path. Modern CPUs have extra HW for fast recovery of branch misses (vs. exceptions) so they don't even have to roll all the way back to the retirement state. No cache invd.Chromogenic
Peter: The question is bogus and makes no sense? Even if I were wrong about terminology or exactly what misprediction did, the question was about taking both braches vs predicting one and hence can't be wrong. So for you to come in and say that it makes no sense and is bogus is actually bogus. The problem is that you seem to have some need to prove your intelligence by acting as an authority/superiority rather than being helpful. Again, to say that the question makes no sense proves that. If it made no sense then the 3 other people wouldn't have been able to understand it to make sense of it.Lucia
Obviously you're asking something about predicting both sides of a branch, but it's not clear what. Hadi just answered generically for any possible meaning your question might have. It would be a much better question if you used correct terminology to ask something specific, like about (pre) fetching into L1d cache or speculative execution. There's a big difference.Chromogenic
Despite being familiar with this stuff, I was also confused starting at the third sentence where it appears the OP thinks that branch prediction is primarily about instruction caching, and never mentions what it is really about: fetch, decode, execute. Caching is only a small part of that, and indeed not the problematic part: if some branch is being frequently mispredicted both sides will be quickly cached since by definition both sides are being taken frequently. You got a good answer because the question in the title is clear - but the rest just takes away from it.Rescission
G
15

A Historical Perspective on Fetching Instructions from both Paths

The first similar proposal (to my knowledge) was discussed in this 1968 patent. I understand that you are only asking about fetching instructions from both branches, but bear with me a little. In that patent, three broad strategies were laid out, one of them is following both paths (the fall-through path and the branch path). That is, not just fetching instructions from both paths, but also executing both paths. When the conditional branch instruction is resolved, one of the paths is discarded. It was only mentioned as an idea in the introduction of the patent, but the patent itself was about another invention.

Later in 1977, a commercial processor was released from IBM, called the IBM 3033 processor. That is the first processor (to my knowledge) to implement exactly what you are proposing. I'm surprised to see that the Wikipedia page did not mention that the processor fetched instructions from both paths. The paper that describes the IBM 3033 is titled "The IBM 3033: An inside look". Unfortunately, I'm not able to find the paper. But the paper on the IBM 3090 does mention that fact. So what you're proposing did make sense and was implemented in real processors about half a decade ago.

A patent was filed in 1981 and granted in 1984 about processor with two memories and instructions can be fetched from both memories simultaneously. I quote from the abstract of the patent:

A dual fetch microsequencer having two single-ported microprogram memories wherein both the sequential and jump address microinstructions of a binary conditional branch can be simultaneously prefetched, one from each memory. The microprogram is assembled so that the sequential and jump addresses of each branch have opposite odd/even polarities. Accordingly, with all odd addresses in one memory and even in the other, the first instruction of both possible paths can always be prefetched simultaneously. When a conditional branch microinstruction is loaded into the execution register, its jump address or a value corresponding to it is transferred to the address register for the appropriate microprogram memory. The address of the microinstruction in the execution register is incremented and transferred to the address register of the other microprogram memory. Prefetch delays are thereby reduced. Also, when a valid conditional jump address is not provided, that microprogram memory may be transparently overlayed during that microcycle.

A Historical Perspective on Fetching and Executing Instructions from both Paths

There is a lot of research published in the 80s and 90s about proposing and evaluating techniques by which instructions from both paths are not only fetched but also executed, even for multiple conditional branches. This will have the potential additional overhead of fetching data required by both paths. The idea of branch prediction confidence was proposed in this paper in 1996 and was used to improve such techniques by being more selective regarding which paths to fetch and execute. Another paper (Threaded Multiple Path Execution) published in 1998 proposes an architecture that exploits simultaneous multithreading (SMT) to run multiple paths following conditional branches. Another paper (Dual Path Instruction Processing) published in 2002 proposes to fetch, decode, and rename, but not execute, instructions from both paths.

Discussion

Fetching instructions from both paths into one or more of the caches reduces the effective capacity of the caches in general, because, typically, one of the paths will be executed much more frequently than the other (in some, potentially highly irregular, pattern). Imagine fetching into the L3 cache, which practically always shared between all the cores and holds both instructions and data. This can have negative impact on the ability of the L3 cache to hold useful data. Fetching into the much smaller L2 cache can even lead to a substantially worse performance, especially when the L3 is inclusive. Fetching instructions from both paths across multiple conditional branches for all the cores may cause hot data held in the caches to be frequently evicted and brought back. Therefore, extreme variants of the technique you are proposing would reduce the overall performance of modern architectures. However, less aggressive variants can be beneficial.

I'm not aware of any real modern processors that fetch instructions on both paths when they see a conditional branch (perhaps some do, but it's not publicly disclosed). But instruction prefetching has been extensively researched and still is. An important question here that needs to be addressed is: what is the probability that a sufficient number of instructions from the other path are already present in the cache when the predicted path turns out to be the wrong path? If the probability is high, then there would be little motivation to fetch instructions from both paths. Otherwise, there is indeed an opportunity. According to an old paper from Intel (Wrong-Path Instruction Prefetching), on the benchmarks tested, over 50% of instructions accessed on mispredicted paths were later accessed during correct path execution. The answer to this question certainly depends on the target domain of the processor being designed.

Giovanna answered 3/4, 2018 at 19:8 Comment(2)
When you say "was implemented in real processors about half a decade ago", is that meant to say "half a century"?Pusey
@Pusey Indeed.Giovanna

© 2022 - 2024 — McMap. All rights reserved.