How to write instruction cache friendly program in c++?

Recently Herb Sutter gave a great talk on "Modern C++: What You Need to Know". The main theme of this talk was efficiency and how data locality and accessing the memory matters. He has also explained how linear access of memory(array/vector) would be loved by CPU. He has taken one example from another classical reference "Game performance by Bob Nystrom" on this topic.

After reading these articles, I got that there is two type of cache which impact the program performance:

Data Cache
Instruction Cache

Cachegrind tool also measures both cache type instrumentation information of our program. The first points has been explained by many article/blog and how to achieve the good data cache efficiency(data locality).

However I did not get much information on topic Instruction Cache and what sort of thing we should take care in our program to achieve the better performance?. As per my understanding, we(programmer) do not have much control on which instruction or what order would be executing.

It would be really nice if small c++ programs explains how this counter(.i.e instruction cache) would vary with our style of writing program. What are the best practice programmer should follow to achieve better performance with respect to this point?

I mean we can understand about data cache topics if our program does(vector vs list) in similar way does it possible to explain about 2nd point. The main intention of this question is to understand this topic as much as possible.

Any code that changes the flow of execution affects the Instruction Cache. This includes function calls and loops as well as dereferencing function pointers.

When a branch or jump instruction is executed, the processor has to spend extra time deciding if the code is already in the instruction cache or whether it needs to reload the instruction cache (from the destination of the branch).

For example, some processors may have a large enough instruction cache to hold the execution code for small loops. Some processors don't have a large instruction cache and simple reload it. Reloading of the instruction cache takes time that could be spent executing instructions.

Search these topics:

Loop unrolling
Conditional instruction execution (available on ARM processors)
Inline functions
Instruction pipeline

Edit 1: Programming techniques for better performance
To improve performance and reduce the instruction cache reloading do the following:

Reduce "if" statements Design your code to minimize "if" statements. This may include Boolean Algebra, using more math or simplifying comparisons (are they really needed?). Prefer to reduce the content of "then" and "else" clauses so that the compiler can use conditional assembly language instructions.

Define small functions as inline or macros
There is an overhead associated with calling functions, such as storing the return location and reloading the instruction cache. For functions with a small amount of statements, try suggesting to the compiler that they be made inline. Inlining means to paste the contents of the code where the execution is, rather than making a function call. Since the function call is avoided, so is the need to reload the instruction cache.

Unroll loops
For small iterations, don't loop, but repeat the content of the loop (some compilers may do this at higher optimization level settings). The more content repeated, the less number of branches to the top of the loop and less need to reload the instruction cache.

Use table lookups, not "if" statements
Some programs use "if-else-if" ladders for mapping data to values. Each "if" statement is a break in the execution in the instruction cache. Sometimes, with a little math, the values can be placed in a table like an array and the index calculated mathematically. Once the index is known, the processor can retrieve the data without disrupting the instruction cache.

Change data or data structures
If the type of data is constant, a program can be optimized around the data. For example, a program handling message packets could base its operations based on the packet IDs (think array of function pointers). Functions would be optimized for packet processing.

Change linked lists to arrays or other random-access container. Elements of an array can be accessed using math and not interrupt execution. Linked lists must be traversed (loop) to find an item.

Recommended topics

Hot tags