How to mark some memory ranges as non-cacheable from C++?
Asked Answered
M

4

17

I was reading the wikipedia on the CPU cache here: http://en.wikipedia.org/wiki/CPU_cache#Replacement_Policies

Marking some memory ranges as non-cacheable can improve performance, by avoiding caching of memory regions that are rarely re-accessed. This avoids the overhead of loading something into the cache, without having any reuse.

Now, I've been reading and learning about how to write programs with better cache performance (general considerations, usually not specific to C++), but I did not know that high-level code can interact with CPU caching behavior explicitly. So my question, is there a way to do what I quoted from that article, in C++?

Also, I would appreciate resources on how to improve cache performance specifically in C++, even if they do not use any functions that deal directly with the CPU caches. For example, I'm wondering if using excessive levels of indirection (eg., a container of pointers to containers of pointers) can damage cache performance.

Marlborough answered 3/3, 2012 at 6:35 Comment(2)
My only experience is the trivial one, with nested loops: if the order of the loops is interchangeable, then paying attention to cache usually impacts performance in a measurable way. But as for the pointers, am also very curious.Hanfurd
@Hanfurd I was not aware of this pointers issue either until I saw this thread on SO: #2487340 The OP in that thread says his smart pointer is messing with his cache, but none of the responses to that thread seem to give a definitive solution.Marlborough
J
7

On Windows, you can use VirtualProtect(ptr, length, PAGE_NOCACHE, &oldFlags) to set the caching behavior for memory to avoid caching.

Regarding too many indirections: Yes, they can damage cache performance, if you access different pieces of memory very often (which is what happens usually). It's important to note, though, that if you consistently dereference the same set of e.g. 8 blocks of memory, and only the 9th block differs, then it generally won't make a difference, because the 8 blocks would be cached after the first access.

Jasso answered 3/3, 2012 at 6:40 Comment(4)
Thanks for the link, but I do not plan on doing any Windows programming. As for your explanation about the indirections: How is that different than normal caching behavior? What I'm getting from what you're saying is that if you dereference something consistently, cache performance won't be damaged. What is the difference when that dereferencing involves one pointer vs four pointers?Marlborough
@newprogrammer: I don't know the equivalent (if one exists in the first place) for Linux/Mac/Unix/etc., sorry. :( Regarding the indirections: it is "normal" behavior, I was just explaining it. :) If you dereference the same things over and over again, it will be in the CPU cache (hopefully L1), and so it will have a 1-cycle penalty -- which is the same as a register load anyway. But if you dereference different pointers a lot, then they would kick each other out of the cache, severely bringing down performance.Jasso
Thanks for the explanation. Anyways, I plan on writing portable code, which is why there will be no Windows programming, not that I will be using another OS.Marlborough
@newprogrammer: Sure. :) I don't think there's any portable way to do this, unfortunately... but "portable" code is, after all, simply code that uses a single interface whose functionality is implemented for all the platforms separately... so you might want to just make your own abstraction around it, implement it for different platforms, and use that abstraction instead. Then everything except your abstraction becomes portable!Jasso
C
6

Some platforms have support for non-temporal loads and stores that bypass caches. That avoids the cost of losing whatever was previously in the cache. They're generally not available to higher-level languages directly and you have to write your own assembly code. But since even the existence of cache is platform-specific, the existence of ways to control the use of cache is likewise platform-specific. SSE4 does include non-temporal loads.

As a programmer, generally dealing with x86 platforms other than Windows, this article on x86 and x86-64 GCC intrinsics is probably the most useful.

Crystacrystal answered 3/3, 2012 at 8:9 Comment(1)
The first link is paywalled. Please cite the relevant parts of the documentation you link as per policy. Link-only answers are frowned upon on this site.Cleavland
C
5

A rule of thumb for optimizing cache accesses is: cluster together loads and stores to the same address. For example, the following code:

for (size_t i=0; i<A.size(); ++i)
  B[i] = func1(A[i]);

for (size_t i=0; i<A.size(); ++i)
  C[i] = func2(A[i]);

Can be optimized to access the cache more efficiently:

for (size_t i=0; i<A.size(); ++i) {
  B[i] = func1(A[i]); // A[i] is fetched to the cache
  C[i] = func2(A[i]); // Good chance that A[i] remains in the cache
}

Modern CPUs are pretty good at predicting memory access with regular patterns in loops, and are able to prefetch memory to cache and speed up execution. So another rule of thumb would be: prefer using std::vector and std::array over other containers.

Cup answered 3/3, 2012 at 7:58 Comment(0)
E
4
Elate answered 3/3, 2012 at 8:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.