How much of ‘What Every Programmer Should Know About Memory’ is still valid?

Asked 14/11, 2011 at 18:30 Answered 8/12, 2017 at 12:32

Solved optimization memory x86 cpu-architecture cpu-cache

265

I am wondering how much of Ulrich Drepper's What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata.

(Also in PDF form on Ulrich Drepper's own site: https://www.akkadia.org/drepper/cpumemory.pdf)

Delectation answered 14/11, 2011 at 18:30 Comment(2)

do someone know if I can download this article in mobi format somewhere so I can easily read it on kindle? "pdf" is very difficult to read because of problems with zoom/formatting – Pyrrhotite 3/3, 2013 at 7:52

It isn't mobi, but LWN ran the paper as a set of articles that are easier to read on a phone/tablet. The first is at lwn.net/Articles/250967 – Thickset 1/9, 2019 at 3:57

142

As far as I remember Drepper's content describes fundamental concepts about memory: how CPU cache works, what are physical and virtual memory and how Linux kernel deals that zoo. Probably there are outdated API references in some examples, but it doesn't matter; that won't affect the relevance of the fundamental concepts.

So, any book or article that describes something fundamental cannot be called outdated. "What every programmer should know about memory" is definitely worth to read, but, well, I don't think it's for "every programmer". It's more suitable for system/embedded/kernel guys.

Joshia answered 14/11, 2011 at 18:40 Comment(2)

Yeah I really don't see why a programmer should need to know how SRAM and DRAM work on the analogue level - that won't help much when writing programs. And people who really need that knowledge, better spend the time reading the manuals about details about the actual timings, etc. But for people interested in the HW low level stuff? Maybe not useful, but at least entertaining. – Cornucopia 14/11, 2011 at 18:52

Nowadays performance == memory performance, so understanding memory is the most important thing in any high performance application. This makes the paper essential for anyone involved in: game development, scientific computing, finance, databases, compilers, processing of large datasets, visualization, anything that has to handle lots of requests... So yes, if you are working in an application that is idle most of the time, like a text editor, the paper is completely uninteresting until you need to do something fast like find a word, count the words, spellcheck... oh wait... nevermind. – Walliw 13/8, 2013 at 21:24

339

The guide in PDF form is at https://www.akkadia.org/drepper/cpumemory.pdf.

It is still generally excellent and highly recommended (by me, and I think by other performance-tuning experts). It would be cool if Ulrich (or anyone else) wrote a 2017 update, but that would be a lot of work (e.g. re-running the benchmarks). See also other x86 performance-tuning and SSE/asm (and C/C++) optimization links in the x86 tag wiki. (Ulrich's article isn't x86 specific, but most (all) of his benchmarks are on x86 hardware.)

The low level hardware details about how DRAM and caches work all still apply. DDR4 uses the same commands as described for DDR1/DDR2 (read/write burst). The DDR3/4 improvements aren't fundamental changes. AFAIK, all the arch-independent stuff still applies generally, e.g. to AArch64 / ARM32.

See also the Latency Bound Platforms section of this answer for important details about the effect of memory/L3 latency on single-threaded bandwidth: bandwidth <= max_concurrency / latency, and this is actually the primary bottleneck for single-threaded bandwidth on a modern many-core CPU like a Xeon. But a quad-core Skylake desktop can come close to maxing out DRAM bandwidth with a single thread. That link has some very good info about NT stores vs. normal stores on x86. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? is a summary.

Thus Ulrich's suggestion in 6.5.8 Utilizing All Bandwidth about using remote memory on other NUMA nodes as well as your own, is counter-productive on modern hardware where memory controllers have more bandwidth than a single core can use. Well possibly you can imagine a situation where there's a net benefit to running multiple memory-hungry threads on the same NUMA node for low-latency inter-thread communication, but having them use remote memory for high bandwidth not-latency-sensitive stuff. But this is pretty obscure, normally just divide threads between NUMA nodes and have them use local memory. Per-core bandwidth is sensitive to latency because of max-concurrency limits (see below), but all the cores in one socket can usually more than saturate the memory controllers in that socket.

(usually) Don't use software prefetch

One major thing that's changed is that hardware prefetch is much better than on the Pentium 4 and can recognize strided access patterns up to a fairly large stride, and multiple streams at once (e.g. one forward / backward per 4k page). Intel's optimization manual describes some details of the HW prefetchers in various levels of cache for their Sandybridge-family microarchitecture. Ivybridge and later have next-page hardware prefetch, instead of waiting for a cache miss in the new page to trigger a fast-start. I assume AMD has some similar stuff in their optimization manual. Beware that Intel's manual is also full of old advice, some of which is only good for P4. The Sandybridge-specific sections are of course accurate for SnB, but e.g. un-lamination of micro-fused uops changed in HSW and the manual doesn't mention it.

The usual advice these days is to remove all SW prefetch from old code, and only consider putting it back in if profiling shows cache misses (and you're not saturating memory bandwidth). Prefetching both sides of the next step of a binary search can still help. e.g. once you decide which element to look at next, prefetch the 1/4 and 3/4 elements so they can load in parallel with loading/checking middle.

The suggestion to use a separate prefetch thread (6.3.4) is totally obsolete, I think, and was only ever good on Pentium 4. P4 had hyperthreading (2 logical cores sharing one physical core), but not enough trace-cache (and/or out-of-order execution resources) to gain throughput running two full computation threads on the same core. But modern CPUs (Sandybridge-family and Ryzen) are much beefier and should either run a real thread or not use hyperthreading (leave the other logical core idle so the solo thread has the full resources instead of partitioning the ROB).

Software prefetch has always been "brittle": the right magic tuning numbers to get a speedup depend on the details of the hardware, and maybe system load. Too early and it's evicted before the demand load. Too late and it doesn't help. This blog article shows code + graphs for an interesting experiment in using SW prefetch on Haswell for prefetching the non-sequential part of a problem. See also How to properly use prefetch instructions?. NT prefetch is interesting, but even more brittle because an early eviction from L1 means you have to go all the way to L3 or DRAM, not just L2. If you need every last drop of performance, and you can tune for a specific machine, SW prefetch is worth looking at for sequential access, but it may still be a slowdown if you have enough ALU work to do while coming close to bottlenecking on memory.

Cache line size is still 64 bytes. (L1D read/write bandwidth is very high, and modern CPUs can do 2 vector loads per clock + 1 vector store if it all hits in L1D. See How can cache be that fast?.) With AVX512, line size = vector width, so you can load/store an entire cache line in one instruction. Thus every misaligned load/store crosses a cache-line boundary, instead of every other for 256b AVX1/AVX2, which often doesn't slow down looping over an array that wasn't in L1D.

Unaligned load instructions have zero penalty if the address is aligned at runtime, but compilers (especially gcc) make better code when autovectorizing if they know about any alignment guarantees. Actually unaligned ops are generally fast, but page-splits still hurt (much less on Skylake, though; only ~11 extra cycles latency vs. 100, but still a throughput penalty).

As Ulrich predicted, every multi-socket system is NUMA these days: integrated memory controllers are standard, i.e. there is no external Northbridge. But SMP no longer means multi-socket, because multi-core CPUs are widespread. Intel CPUs from Nehalem to Skylake have used a large inclusive L3 cache as a backstop for coherency between cores. AMD CPUs are different, but I'm not as clear on the details.

Skylake-X (AVX512) no longer has an inclusive L3, but I think there's still a tag directory that lets it check what's cached anywhere on chip (and if so where) without actually broadcasting snoops to all the cores. SKX uses a mesh rather than a ring bus, with generally even worse latency than previous many-core Xeons, unfortunately.

Basically all of the advice about optimizing memory placement still applies, just the details of exactly what happens when you can't avoid cache misses or contention vary.

6.1 Bypassing the Cache - SSE4.1 movntdqa (_mm_stream_load_si128) NT loads only ever do anything on WC memory regions. On normal memory you get from malloc/new or mmap (WB memory attribute = write-back cacheable), movntqda runs the same as a normal SIMD load, not bypassing cache. But it costs an extra ALU uop. AFAIK, this was true even on CPUs at the time the article was written, making this a rare mistake in the guide. Unlike NT stores, NT loads do not override the usual memory ordering rules for the region. And they have to respect coherency, so they can't fully skip cache on WB-cacheable regions, the data needs to be somewhere other cores can invalidate on write. But SSE4.1 wasn't introduced until 2nd-gen Core 2, so there weren't single-core CPUs with it.

NT prefetch (prefetchnta) can minimize cache pollution, but does still fill L1d cache, and one way of L3 on Intel CPUs with inclusive L3 cache. But it's brittle and hard to tune: to short a prefetch distance and you get demand loads which probably defeat the NT aspect, too long and your data is evicted before use. And since it wasn't in L2, and maybe not even L3, it might miss all the way to DRAM. Since prefetch distance depends on the system and workload from other code, not just your own code, this is a problem.

Difference between PREFETCH and PREFETCHNTA instructions
Non-temporal loads and the hardware prefetcher, do they work together?
Copying Accelerated Video Decode Frame Buffers - Intel whitepaper describing the use-case for SSE4.1 movntdqa, reading from video RAM.

6.4.2 Atomic ops: the benchmark showing a CAS-retry loop as 4x worse than hardware-arbitrated lock add does probably still reflect a maximum contention case. But in real multi-threaded programs, synchronization is kept to a minimum (because it's expensive), so contention is low and a CAS-retry loop usually succeeds without having to retry.

C++11 std::atomic fetch_add will compile to a lock add (or lock xadd if the return value is used), but an algorithm using CAS to do something that can't be done with a locked instruction is usually not a disaster. Use C++11 std::atomic or C11 stdatomic instead of gcc legacy __sync built-ins or the newer __atomic built-ins unless you want to mix atomic and non-atomic access to the same location...

8.1 DWCAS (cmpxchg16b): You can coax gcc into emitting it, but if you want efficient loads of just one half of the object, you need ugly union hacks: How can I implement ABA counter with c++11 CAS?. (Don't confuse DWCAS with DCAS of 2 separate memory locations. Lock-free atomic emulation of DCAS isn't possible with DWCAS, but transactional memory (like x86 TSX) makes it possible.)

8.2.4 transactional memory: After a couple false starts (released then disabled by a microcode update because of a rarely-triggered bug), Intel has working transactional memory in late-model Broadwell and all Skylake CPUs. The design is still what David Kanter described for Haswell. There's a lock-elision way to use it to speed up code that uses (and can fall back to) a regular lock (especially with a single lock for all elements of a container so multiple threads in the same critical section often don't collide), or to write code that knows about transactions directly.

Update: and now Intel has disabled lock-elision on later CPUs (including Skylake) with a microcode update. The RTM (xbegin / xend) non-transparent part of TSX can still work if the OS allows it, but TSX in general is seriously turning into Charlie Brown's football.

Has Hardware Lock Elision gone forever due to Spectre Mitigation? (Yes but because of a MDS type of side-channel vulnerability (TAA), not Spectre. My understanding is that updated microcode does totally disable HLE. In that case the OS can only enable RTM, not HLE.)

7.5 Hugepages: anonymous transparent hugepages work well on Linux without having to manually use hugetlbfs. Make allocations >= 2MiB with 2MiB alignment (e.g. posix_memalign, or an aligned_alloc that doesn't enforce the stupid ISO C++17 requirement to fail when size % alignment != 0).

A 2MiB-aligned anonymous allocation will use hugepages by default. Some workloads (e.g. that keep using large allocations for a while after making them) may benefit from
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag to get the kernel to defrag physical memory whenever needed, instead of falling back to 4k pages. (See the kernel docs). Use madvise(MADV_HUGEPAGE) after making large allocations (preferably still with 2MiB alignment) to more strongly encourage the kernel to stop and defrag now. defrag = always is too aggressive for most workloads and will spend more time copying pages around than it saves in TLB misses. (kcompactd could maybe be more efficient.)

BTW, Intel and AMD call 2M pages "large pages", with "huge" only used for 1G pages. Linux uses "hugepage" for everything larger than the standard size.

(32-bit mode legacy (non-PAE) page tables only had 4M pages as the next largest size, with only 2-level page tables with more compact entries. The next size up would have been 4G, but that's the whole address space, and that "level" of translation is the CR3 control register, not a page directory entry. IDK if that's related to Linux's terminology.)

Appendix B: Oprofile: Linux perf has mostly superseded oprofile. perf list / perf stat -e event1,event2 ... has names for most of the useful ways to program HW performance counters.

perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,\
branches,branch-misses,instructions,uops_issued.any,\
uops_executed.thread,idq_uops_not_delivered.core -r2 ./a.out

A few years ago, the ocperf.py wrapper was needed to translate event names into codes, but these days perf has that functionality built-in.

For some examples of using it, see Can x86's MOV really be "free"? Why can't I reproduce this at all?.

Farmland answered 8/12, 2017 at 12:32 Comment(16)

Very instructive answer and pointers! This clearly deserve more votes! – Shotputter 28/2, 2018 at 13:44

@Peter Cordes Are there any other guides/papers you'd recommend reading? I'm not a high performance programmer but I'd like to learn more about it and hopefully pick up practices which I can incorporate into my daily programming. – Harrelson 9/4, 2019 at 7:26

@user3927312: agner.org/optimize is one of the best and most coherent guides to low-level stuff for x86 specifically, but some of the general ideas apply to other ISAs. As well as asm guides, Agner has an optimizing C++ PDF. For other performance / CPU-architecture links, see stackoverflow.com/tags/x86/info. I've also written some about optimizing C++ by helping the compiler make better asm for critical loops when it's worth have a look at the compiler's asm output: C++ code for testing the Collatz conjecture faster than hand-written asm? – Farmland 9/4, 2019 at 7:31

Minor nitpick - the 2 MiB pages aren't "huge" pages on 80x86. For 80x86 page sizes are 4 KiB, large (2 MiB or 4 MiB) and huge (1 GiB); and this terminology (plain/normal, large and huge) is shared by everything (Windows, Linux, etc). AMD also started doing a "coalesce four physically and linearly consecutive pages into a single TLB entry" thing, creating a kind of "psuedo-16 KiB page" size (medium pages? I don't know). – Jeanicejeanie 15/12, 2020 at 1:16

@Brendan: I'm using the same terminology as the Linux kernel, which refers to anything above the minimum as a "hugepage", as in "transparent hugepages", with hugepage sizes of 2M or 1G on x86-64. kernel.org/doc/Documentation/vm/transhuge.txt, and especially kernel.org/doc/Documentation/vm/hugetlbpage.txt where Linux makes its usage clear via examples. Is the "large page" term widely used in other circles? (Interesting point about hardware transparent TLB coalescing or whatever AMD calls it, I guess that means the page-walker checks 4x8 bytes of aligned PTEs.) – Farmland 15/12, 2020 at 1:24

@PeterCordes: "large pages" are what Intel and AMD have always called 2 MiB (and 4 MiB) pages. Windows also calls them large pages (e.g. MEM_LARGE_PAGES flag for VirtualAlloc()). Linux seems to support one or the other but not both at the same, and uses the same word for either case. Note that it's relatively shocking how crippled operating systems are (Windows not supporting 1 GiB pages at all, requiring special permission just to use 2 MiB pages, not allowing 2 MiB pages to be "pageable"; and Linux having a cesspool of hackery with 2 separate systems and no way for user-space to choose) – Jeanicejeanie 15/12, 2020 at 5:15

@Brendan: If you use Linux's clunky hugetlbfs (not possible for transparent hugepages), you can use mmap with MAP_HUGE_2MB or MAP_HUGE_1GB. But otherwise totally agreed, it's pretty bad how much it sucks to get hugepages most of the time, e.g. never for file-backed mappings. But at least with madvise(MADV_HUGEPAGE) you can often get the perf you want, with the right background vs. foreground defrag settings. It's somewhat understandable, given the desire to keep fast-paths fast, but you'd think it would be better sorted out by now. It's not like mlock that needs strict limits. – Farmland 15/12, 2020 at 7:7

@PeterCordes: I think the central problem (that causes all the other restrictions and limitations) is that Linux lacks the ability to split large/huge pages up into multiple small/large pages and combine multiple small/large pages back into a single large/huge page again. The former (splitting) is easy. The latter (combining) takes special memory management plus some "active scavenging" (to find & replace the few remaining pages needed to merge into a whole larger page when most of the pages are free but some aren't; and to prevent reaching an "everything fragmented" state). – Jeanicejeanie 15/12, 2020 at 8:56

@Brendan: Linux certainly can combine multiple small pages into a large page; see kernel.org/doc/Documentation/vm/transhuge.txt. Active scavenging (by defragging) is what khugepaged does, unless you disable it with echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag. There are some other tune settings to control when an mmap allocation and/or madvise waits for defragging vs. starting with small pages and working in the background. (echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag). If you didn't know about this, Linux is less bad than you think! – Farmland 15/12, 2020 at 13:12

@PeterCordes: I mean for free physical pages; not for allocated virtual pages. If Linux supported splitting/merging of free physical pages, the total number of large pages would change often based on demand; and it wouldn't make sense to have "fixed number of large pages" (e.g. set by /proc/sys/vm/nr_hugepages). Instead; at boot you'd always start with the maximum number of free large pages that are possible and then you'd split them whenever you run out of free small pages (and if there's lots of free small pages you'd attempt to merge them back into free physical large pages). – Jeanicejeanie 15/12, 2020 at 20:39

@PeterCordes: Note that this would get rid of all the admin hassle, make it easier to support large pages for things like memory mapped files and swap space (as the physical page could just be split if the backing store doesn't support large pages), make it much more able to adjust to demand (no more "large pages are reserved and cannot be used for other purposes" silliness); and when there are multiple page sizes the benefits are multiplied (e.g. free 1 GiB page can be split into 512 free 2 MiB pages, that can be split into 128 free 64 KiB pages, that can be split into four 4KiB pages). – Jeanicejeanie 15/12, 2020 at 20:55

@Brendan: The transparent-hugepage mechanism does adapt to demand for anonymous pages; khugepaged defrags physical memory by copying small pages. I think it can split again under memory pressure to page out part of what was a large page. You probably know this, but for future readers your comments make it sound like Linux could barely use large pages at all. In practice many workloads have a lot of anonymous pages. But that's a separate mechanism from the non-transparent way to alloc large pages, or from being able to use them for file-backed mappings (doing large page-ins?) – Farmland 15/12, 2020 at 23:11

@PeterCordes:P No. As far as I can tell; the transparent-hugepage mechanism works purely with virtual memory (e.g. replacing 512 4 KiB virtual pages with one 2 MiB virtual page, by taking a free 2 MiB physical page from a fixed size pool, copying data and mapping the 2 MiB page, then freeing the original 512 4 KiB physical pages) and never does anything with physical pages (e.g. never combines 512 free 4 KiB physical pages into a 2 MiB free physical page). The defrag done by khugepaged is also purely virtual (e.g. "defrag = defer" to pull data from swap can't make sense for physical pages). – Jeanicejeanie 16/12, 2020 at 3:2

@Brendan: My desktop with 16GiB of total physical RAM has /proc/meminfo showing AnonHugePages: 1058816 kB and HugePages_Total: 0 (matching /proc/sys/vm/nr_hugepages). I've had significantly more total hugepages when I had fewer browser tabs open and have run a test that allocates a big region, uses madvise on it, and loops over it touching it. (with defer+madvise or with always defrag settings). Re-running the same test a few times in a row increases the amount of hugepages by doing even more defragging. – Farmland 16/12, 2020 at 3:14

@Brendan: Oops, I checked the docs again: defer mentions activating kswapd and kcompactd. So they do the defragging, making THPs available for khugepaged to install. I hope that includes copying/moving physical pages from mem to mem and updating PTEs that ref them, not literally paging them out and back in to a different phys page. (I hope that's why kcompactd exists separate from kswapd). But anyway, what you describe just sounds like coalescing virtual pages, not anything worthy of the name "defrag". Without garbage collection, you can't defrag virtual address space (user or kernel) – Farmland 16/12, 2020 at 3:25

@PeterCordes: Ah, you're right (kcompactd is the clue I was missing). It's actually worse/more awful than I suspected - like the name suggests, it literally does "compaction" (moving everything from one end of a physical memory zone to another) rather than only bothering when most small pages belonging to the larger page are already free. – Jeanicejeanie 16/12, 2020 at 10:6

142

Joshia answered 14/11, 2011 at 18:40 Comment(2)

From my quick glance-through it looks quite accurate. The one thing to notice, is the portion on the difference between "integrated" and "external" memory controllers. Ever since the release of the i7 line Intel CPUs are all integrated, and AMD has been using integrated memory controllers since the AMD64 chips were first released.

Since this article was written, not a whole lot has changed, speeds have gotten higher, the memory controllers have gotten much more intelligent (the i7 will delay writes to RAM until it feels like committing the changes), but not a whole lot has changed. At least not in any way that a software developer would care about.

Lou answered 14/11, 2011 at 18:40 Comment(1)

Probably the most major change that's relevant for SW developers is that prefetch threads are a bad idea. CPUs are powerful enough to run 2 full threads with hyperthreading, and have much better HW prefetch. SW prefetch in general is a lot less important, especially for sequential access. See my answer. – Farmland 8/12, 2017 at 12:34

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

(usually) Don't use software prefetch

Recommended topics

Hot tags