How to write instruction cache friendly program in c++?
Asked Answered
G

1

17

Recently Herb Sutter gave a great talk on "Modern C++: What You Need to Know". The main theme of this talk was efficiency and how data locality and accessing the memory matters. He has also explained how linear access of memory(array/vector) would be loved by CPU. He has taken one example from another classical reference "Game performance by Bob Nystrom" on this topic.

After reading these articles, I got that there is two type of cache which impact the program performance:

  1. Data Cache
  2. Instruction Cache

Cachegrind tool also measures both cache type instrumentation information of our program. The first points has been explained by many article/blog and how to achieve the good data cache efficiency(data locality).

However I did not get much information on topic Instruction Cache and what sort of thing we should take care in our program to achieve the better performance?. As per my understanding, we(programmer) do not have much control on which instruction or what order would be executing.

It would be really nice if small c++ programs explains how this counter(.i.e instruction cache) would vary with our style of writing program. What are the best practice programmer should follow to achieve better performance with respect to this point?

I mean we can understand about data cache topics if our program does(vector vs list) in similar way does it possible to explain about 2nd point. The main intention of this question is to understand this topic as much as possible.

Gorky answered 7/4, 2014 at 19:29 Comment(3)
Avoid virtual functions, and break complex code into smaller loops.Hawley
@Leeor: in what way is that duplicate?Hawley
Sorry, I referred to the wrong question, I'll retract the vote once I'm next to a computer, but I'm positive it was answered before with code caches context, including trace caches, decoded uop caches, etc..Closeknit
A
14

Any code that changes the flow of execution affects the Instruction Cache. This includes function calls and loops as well as dereferencing function pointers.

When a branch or jump instruction is executed, the processor has to spend extra time deciding if the code is already in the instruction cache or whether it needs to reload the instruction cache (from the destination of the branch).

For example, some processors may have a large enough instruction cache to hold the execution code for small loops. Some processors don't have a large instruction cache and simple reload it. Reloading of the instruction cache takes time that could be spent executing instructions.

Search these topics:

  • Loop unrolling
  • Conditional instruction execution (available on ARM processors)
  • Inline functions
  • Instruction pipeline

Edit 1: Programming techniques for better performance
To improve performance and reduce the instruction cache reloading do the following:

Reduce "if" statements Design your code to minimize "if" statements. This may include Boolean Algebra, using more math or simplifying comparisons (are they really needed?). Prefer to reduce the content of "then" and "else" clauses so that the compiler can use conditional assembly language instructions.

Define small functions as inline or macros
There is an overhead associated with calling functions, such as storing the return location and reloading the instruction cache. For functions with a small amount of statements, try suggesting to the compiler that they be made inline. Inlining means to paste the contents of the code where the execution is, rather than making a function call. Since the function call is avoided, so is the need to reload the instruction cache.

Unroll loops
For small iterations, don't loop, but repeat the content of the loop (some compilers may do this at higher optimization level settings). The more content repeated, the less number of branches to the top of the loop and less need to reload the instruction cache.

Use table lookups, not "if" statements
Some programs use "if-else-if" ladders for mapping data to values. Each "if" statement is a break in the execution in the instruction cache. Sometimes, with a little math, the values can be placed in a table like an array and the index calculated mathematically. Once the index is known, the processor can retrieve the data without disrupting the instruction cache.

Change data or data structures
If the type of data is constant, a program can be optimized around the data. For example, a program handling message packets could base its operations based on the packet IDs (think array of function pointers). Functions would be optimized for packet processing.

Change linked lists to arrays or other random-access container. Elements of an array can be accessed using math and not interrupt execution. Linked lists must be traversed (loop) to find an item.

Adamson answered 7/4, 2014 at 19:39 Comment(10)
"Unroll loops" -> I believe that one is completely wrong. The full loop will most likely be kept in cache, and branching to the top should not cause any problem. Keeping loops is one of the main reasons why code is stored in the cache and not just prefetched. Actually it's probably the opposite, it is recommended not unrolling loops to avoid generating tons of repeated instructions that will make your code larger and less likely to fit in cache.Alyse
Whether or not a loop fits in the instruction cache is dependent on the processor. There are many different processors out there. Also, loop unrolling may be recognized by the compiler and the compiler can emit instructions for parallel processing. The primary purpose of loop unrolling is to avoid branch instructions, that in worst case, cause a reloading of the instruction cache. So, even if the loop doesn't fit into the instruction cache, you can eliminate the cost of branching to the top of the loop by unrolling.Adamson
Also, there is the time versus space trade-off. You'll find that fast algorithms tend to take up more space than optimizations for space. Saving space often slows down execution. So, making the code larger is a sacrifice for making the code faster.Adamson
While it makes sense with your explanations, this is the first time I've heard about using loop unrolling to improve prefetch. Most processors seem to have 32 kB instruction caches nowadays, that looks large enough to me for handling most loops, and even if the loop does not fit in cache wouldn't the loop branch be predicted and prefetched by the processor anyway ?Alyse
I'm not talking explicitly about prefetch, but actual parallel execution of instructions. Some processors have the capability to process instructions in parallel. Under advanced optimization, compilers can emit instructions to these processors to execute instructions in parallel.Adamson
Branch prediction is a nice feature, but it takes {execution} time. Removing branches reduces the need for branch prediction and speeds up the program. Loop unrolling may not be effective when there are function calls within the loop, provided the functions are not inlined.Adamson
@ThomasMatthews you mentioned "a program handling message packets ... Functions would be optimized for packet processing", could you elaborate what "functions would be optimized for packet processing" means? Do you mean that calling functions via function pointers in an array is faster (or more instruction cache friendly?) than using a switch block to dispatch to those functions? Thanks!Syringe
@HCSF: First, have an array of pointers to functions, based on the packet ID. Each function would be tailored to the packet ID. For example, a function to process a message for reading hard drive data would be optimized for reading the data . A function to process a message for reading hard drive ID would be optimized for reading the hard drive data. These functions could be optimized for their specializations rather than have one function to handle all messages. Let's just say that a small specialized function can be optimized better than one huge generic function.Adamson
@HCSF: Since there is no universally standardized packet, I can't go into details about the packet processing function. All the details depend on the data or message specification. For example, ATAPI messages have different layouts than JSON packets or TCP packets. In the embedded systems world, there could be custom packets from devices to the host; or from a host to a database.Adamson
@ThomasMatthews ah, so your point is that breaking a huge generic function into small ones based on the specific message IDs (e.g. struct msg { uint8_t msgType;}) could be "optimized" (by hands? by compiler?). And you weren't saying using funcPtrs[msg.msgType] is more I-cache friendly/faster than switch (msg.msgType) {...} right?Syringe

© 2022 - 2024 — McMap. All rights reserved.