X86 prefetching optimizations: "computed goto" threaded code
Asked Answered
D

1

5

I have a rather non-trivial problem, where my computational graph has cycles and multiple "computational paths". Instead of making a dispatcher loop, where each vertex will be called one-by-one, I had an idea to place all pre-allocated "frame objects" in the heap (code+data).
This is somewhat analogous to threaded code (or even better: CPS), just jumping around the heap, executing code. Each code piece is associated with its own "frame pointer" in the heap and uses data relative to that. Frames remain always allocated. The code just produces side-effects at known locations, computes (if necessary) next goto value and jumps there.
I haven't tried it out yet (this will be a major undertaking to make it correct and I am fully aware of all the difficulties) so I wanted to ask experts on x86 machinery: can it be faster than a dispatcher loop? I know there are several optimizations for call/ret instructions taking place in hardware.
Is there a difference between accessing data relative to the stack pointer or any other pointer? Is there a prefetching for an indirect jump (jump to value stored in register?).
Is this idea even viable?

P.S. if you've read this and still couldn't understand what I mean by this idea (pardon my failed attempts to explain things) imagine this whole as a set of many pre-allocated coroutines on a heap which yield to each other. The standard x86 stack is not used in the process, as everything is on the heap.

Dank answered 20/9, 2017 at 12:1 Comment(3)
Generally, you need to understand that code and data caches are separate, so when you jump to recently written data, the code fetch is essentially uncached as far as I know.Goodsell
I know that. The code will remain static, once all frames are allocated and linked.Dank
All data locations are also pre-allocated. So when you jump to a new location, first of all something like FrameObj* this = address; is executed and every piece of data of that code is relative to "this". This adress is static for each piece of codeDank
C
11

Jumping directly from block to block is often a win for branch prediction, vs. returning up to one parent indirect-branch, especially on CPUs older than Intel Haswell.


With jumps from the tail of each block, each branch has a different branch-predictor history. It's probably common for a given block to usually jump to the same next block, or a have a simple pattern of a couple target addresses. This can often be predicted well because each branch individually has a simpler pattern, and the branch history is distributed over multiple branches.

If all dispatching happens from a single indirect branch, there's may only be one BTB (branch target buffer) entry for it, and the pattern will be too complicated for it to predict well.

Modern TAGE branch predictors in Intel Haswell and later index the BTB using recent branch history, including indirect-branch destination, which does actually work around this problem. See comments on Indexed branch overhead on X86 64 bit mode, and search for Haswell in https://danluu.com/branch-prediction/ . Complicated patterns for one branch can scatter predictions for it across many of the BTB entries.

Specifically, Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) by Rohou, Swamy, and Seznec compares Nehalem, SandyBridge, and Haswell on interpreter benchmarks and measures actual mispredict rate for dispatch loops with a single switch statement. They find that Haswell does much better, likely using an ITTAGE predictor.

They don't test AMD CPUs. AMD has published some info on their CPUs since Piledriver using Perceptron neural networks for branch prediction. I don't know how well they handle dispatch loops with a single indirect branch. (AMD since Zen 2 uses IT-TAGE as a second level branch predictor, in addition to the hashed perceptron they kept from Zen 1.)


Darek Mihocka discusses this pattern in the context of a interpreting CPU emulator, which jumps from block to block of handlers for different instructions (or simplified uops). He goes into a lot of detail about the performance of various strategies on Core2, Pentium4, and AMD Phenom. (It was written in 2008). Modern branch predictors on current CPUs are most like the Core2.

He eventually presents what he calls the Nostradamus Distributor pattern for checking for early-out (functions return a function-pointer, or a "fire escape" sentinel), in a branch-prediction-friendly way. If you don't need that, just see the early part of the article where he talks about direct chaining of jumps between blocks vs. a central distributor.

He even bemoans the lack of a code prefetch instruction in x86. That was was probably a bigger deal with Pentium 4, where initial decode to populate the trace cache was very slow compared to running from the trace cache. Sandybridge-family has a decoded-uop cache, but it isn't a trace cache, and the decoders are still strong enough to not suck when the uop cache misses. Ryzen is similar.

Is there a difference between accessing data relative to the stack pointer or any other pointer?

No. You could even set rsp after jumping so each block can have its own stack. If you have any signal handlers installed, rsp needs to point to valid memory. Also, if you want to be able to call any normal library functions, you need rsp to work as a stack pointer, because they will want to ret.

Is there a prefetching for an indirect jump (jump to value stored in register?).

Prefetch into L2 could be useful if you know the branch target address long before you're ready to execute an indirect jump. All current x86 CPUs use split L1I / L1D caches, so prefetcht0 would pollute L1D for no gain, but prefetcht1 might be useful (fetch into L2 and L3). Or it might not be useful at all, if the code is already hot in L2.

Also useful: calculate the jump target address as early as possible, so out-of-order execution can resolve the branch while lots of work is queued up in the out-of-order core. This minimizes the potential bubble in the pipeline. Keep the calculation independent of other stuff if possible.

Best case is address in a register many instructions before the jmp, so as soon as the jmp gets a cycle on an execution port, it can provide the correct destination to the front-end (and re-steer if branch prediction got it wrong). Worst case is when the branch target is the result of a long dependency chain of instructions right before the branch. A couple independent instructions, and/or a memory-indirect jump, is fine; out-of-order execution should find cycles to run those instructions once they're in the OOO scheduler.

There are also split L1iTLB and L1dTLBs, but the L2TLB is usually unified on most microarchitectures. But IIRC, the L2TLB works as a victim cache for the L1 TLBs. A prefetch might trigger a page walk to populate an entry in the L1 data TLB, but on some microarchitectures that wouldn't help avoid an iTLB miss. (At least it would get the page table data itself into L1D or maybe internal page-directory caches in the page-walk hardware, so another page walk for the same entry would be fast. But since CPUs other than Intel Skylake (and later) only have 1 hardware page-walk unit, if the iTLB miss happens while the first page walk is still happening, it might not be able to start right away, so could actually hurt if your code is so scattered that you're getting iTLB misses.)

Use 2MB hugepages for the chunk of memory you will JIT into to reduce TLB misses. Probably best to lay out the code in a fairly tight region, with the data separate. DRAM locality effects are a real thing. (A DRAM page is usually bigger than 4kiB, I think, but it's a hardware thing and you can't choose. It's lower latency to access within an already-open page.)

See Agner Fog's microarch pdf, and also Intel's optimization manual.. (And AMD's manual too, if you are worried about AMD CPUs). See more links in the tag wiki.

Is this idea even viable?

Yes, probably.

If possible, when one block always jumps to another block, elide the jump by making the blocks contiguous.

Relative addressing for data is easy: x86-64 has RIP-relative addressing.

You can lea rdi, [rel some_label] and then index from there, or just use RIP-relative addressing directly for some of your static data.

You're going to be JITting your code or something, so just calculate signed offsets from the end of the current instruction to the data to be accessed, and that's your RIP-relative offset. Position-independent code + static data is easy in x86-64.


In Granite Rapids and later, PREFETCHIT0 [rip+rel32] prefetches code into "all levels" of cache, or prefetchit1 to prefetch into all levels except L1i.

These instructions are a NOP with an addressing-mode other than RIP-relative, or on CPUs that don't support them. (Perhaps they also prime iTLB or even uop cache, or at least could on paper.) The docs in Intel's "future extensions" manual as of 2022 Dec recommends that the target address be the start of some instruction.

It's only useful to prefetch if you do it early enough, and it doesn't solve the mispredict problem. It might or might not be a win for an interpreter to prefetch the code for the bytecode instruction after the current one. prefetchit0 can't do that, it only works with RIP-relative addressing, no indirection. Perhaps because the code-fetch parts of the CPU like L1i and iTLB don't have AGUs for arbitrary addresses, if it works by feeding the address to those? So it's no help in prefetching a runtime-variable code location.

Cybernetics answered 20/9, 2017 at 13:22 Comment(4)
I've also figured out a way to test the idea without the need of a JIT! Also, I know its frowned upon, but thank you for the very detailed and elaborate answer. This helped a lot!Dank
@artemonster: Cheers. It's not too frowned upon if you also have something else to day. For testing, you'd probably just worry about the branching part, right? Having static data near each code block is a separate thing (and of questionable utility). Anyway, yeah, easy in asm. From C you could arrange the code so it can tail-call the next function.Cybernetics
I'll make a really big function with each code block enclosed in braces and labeled. at the beginning of the function i'll initialize all contexts with proper label values, aka. someFrameObject150.next = &&somelabel; and just do a goto *(startFrame.next) :) I'll also have the same code split in functions and have them to return a function pointer for the next for the dispatcher (which will be called from dispatcher). I can't wait to try it out :)Dank
@artemonster: make sure you read Darek Mihocka's article so you know what mistakes to avoid. And also so you can look at the asm to check that the compiler didn't "optimize" things into one common distributor indirect-branch, which defeats the whole purpose.Cybernetics

© 2022 - 2024 — McMap. All rights reserved.