Memory mapping/allocating is lazy: the first access to a page will cause a page fault exception (#PF
on x86). This includes the BSS, as well as file-backed mappings like the text segment of your executable. These page faults are "valid" so they don't result in a SIGSEGV being delivered; instead the kernel allocates a physical page if necessary and wires up the hardware page tables so the load or store can rerun and not fault the 2nd time.
This is expensive, especially if the kernel doesn't "fault-around" and prepare multiple pages during one page fault. (Especially with Spectre + Meltdown mitigation enabled making user<->kernel round trips more expensive on current x86-64 hardware.)
You're letting std:vector
's constructor write zeros to the array after dynamic allocation1. std::vector
does all the page-faulting outside your timed loop. This happens before main, while the implementation is running constructors for static objects.
But the array is zero-initialized so it gets placed in the BSS. The first thing to touch it is your loop. Your array<>
loop pays for all the page faults inside the timed region.
If you used new int[n]
to dynamically allocate but not initialize a block of memory, you'd see the same behaviour as from your static array<>
. (Maybe slightly better if Linux is more willing to use transparent hugepages for a dynamic allocation instead of the BSS mapping.)
Footnote 1 std::vector
in libstdc++ and libc++ is too stupid to take advantage of getting already-zeroed pages from the OS, like it could if it used calloc
or equivalent. It would be possible if the library provided a new
/delete
-compatible allocator for zeroed memory.
C++ new
/delete
is crippled vs. malloc/free/calloc/realloc. I have no idea why ISO C++ left out calloc and realloc: both are very useful for large allocations, especially realloc for resizing a std::vector of trivially-copyable objects that might have room to grow its mapping without copying. But since new
/delete
aren't guaranteed to be compatible with malloc
/free
, and new
is replaceable, libraries can't very easily use calloc
and realloc
even under the hood.
Another factor: read-only leaves pages CoW mapped to the same physical zero page
When lazy allocation is triggered by a read (instead of write), it reads as zero. (BSS pages read as zero, new pages from mmap(MAP_ANONYMOUS)
read as all-zero.)
The (soft) page fault handler that wired up the HW page table didn't need to actually allocate a physical page aka page-frame to back that virtual page. Instead, Linux maps clean (unwritten) anonymous pages to a single physical zeroed page. (This applies across all tasks.)
If we make multiple passes over the array, this leads to the curious situation where we can get TLB misses but L1d or L3 hits (depending on hugepage or not) because we have multiple virtual pages pointing to the same physical location.
(Some CPUs, e.g. AMD Ryzen, use micro-tagging in the L1d cache to save, at the cost of the cache only being able to hit for one virtual address even if the same memory is mapped to multiple virtual addresses. Intel CPUs use true VIPT L1d caches and really can get this effect),
I made a test program for Linux that will use madvise(MADV_HUGEPAGE)
(to encourage the kernel to defrag memory for hugepages) or madvise(MADV_NOHUGEPAGE)
(to disable hugepages even for the read-only case).
For some reason Linux BSS pages don't use hugepages when you write them. Only reading them does use 2M hugepages (too big for L1d or L2, but does fit in L3. But we do get all TLB hits). It's hard to see this in /proc/PID/smaps
because unwritten memory doesn't show up as "resident" at all. (Remember it's physically backed by a system-wide shared region of zeroes).
I made some changes to your benchmark code to rerun the sum loop multiple times after an init pass that either reads or writes the array, according to command-line args. The repeat-loop makes it run longer so we can get more precise timing, and to amortize the init so we get useful results from perf.
#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>
#include <sys/mman.h>
using namespace std;
constexpr int n = 100'000'000;
//vector<int> v(n);
alignas(4096) array<int, n> v;
//template<class T>
__attribute__((noinline))
int toucharray(volatile int *vv, int write_init) {
int res=vv[0];
for(int i=32 ; i<n ; i+=128)
if(write_init)
vv[i] = 0;
else
res += vv[i];
// volatile int sum = res; // noinline is fine, we don't need to stop multiple calls from CSEing
return res;
}
template <class T>
__attribute__((noinline,noclone))
int sum_container(T &vv) {
unsigned int res=0;
for(int x : vv)
res += x;
__attribute__((used)) static volatile int sink;
sink = res; // a side-effect stops IPA from deciding that this is a pure function
return res;
}
int main(int argc, char**argv)
{
int write_init = 0;
int hugepage = 0;
if (argc>1) {
hugepage = argv[1][0] & 1;
write_init = argv[1][0] & 2;
}
int repcount = 1000;
if (argc>2)
repcount = atoi(argv[2]);
// TODO: option for no madvise.
madvise(v.data(), n*sizeof(v[0]), MADV_SEQUENTIAL);
madvise(v.data(), n*sizeof(v[0]), hugepage ? MADV_HUGEPAGE : MADV_NOHUGEPAGE);
madvise(v.data(), n*sizeof(v[0]), MADV_WILLNEED);
// SEQ and WILLNEED probably only matter for file-backed mappings to reduce hard page faults.
// Probably not encouraging faultahead / around for lazy-allocation soft page fault
toucharray(v.data(), write_init);
int res = 0;
auto start = chrono::steady_clock::now();
for(int i=0; i<repcount ; i++)
res = sum_container(v);
auto end = chrono::steady_clock::now();
double elapsed =
std::chrono::duration_cast<
std::chrono::duration<double, std::milli>
>(end - start).count();
printf("result: %d\ntime: %f\n", res, elapsed);
}
best case: clang++ -O3 -march=native (skylake) actually unrolls with multiple accumulators, unlike gcc -funroll-loops which does a silly job.
On my Skylake i7-6700k with DDR4-2666 DRAM, configured for 4.2GHz max turbo and governor=performance -
# using std::array<int,n>
# 0&1 = 0 -> MADV_NOHUGEPAGE. 0&2 = 0 -> read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 0 1000
result: 0
time: 1961.952394
Performance counter stats for './touchpage-array-madv-nohuge-argc.clang 0 1000':
2,017.34 msec task-clock:u # 1.000 CPUs utilized
50 context-switches # 0.025 K/sec
0 cpu-migrations # 0.000 K/sec
97,774 page-faults # 0.048 M/sec
8,287,680,837 cycles # 4.108 GHz
14,500,762,859 instructions # 1.75 insn per cycle
13,688 mem_load_retired.l2_hit:u # 0.007 M/sec
12,501,329,912 mem_load_retired.l1_hit:u # 6196.927 M/sec
144,559 mem_inst_retired.stlb_miss_loads:u # 0.072 M/sec
2.017765632 seconds time elapsed
1.979410000 seconds user
0.036659000 seconds sys
Notice considerable TLB misses (mem_inst_retired.stlb_miss_loads:u
counts 2nd-level TLB misses in user-space). And 97k page faults. That's pretty much exactly as many 4k pages as it takes to cover the 100M * 4 = 400MB array, so we got 1 fault per page with no pre-fault / fault-around.
Fortunately Skylake has two page-walk units so it can be doing two speculative page-walks in parallel. Also, all the data access is hitting in L1d so page-tables will stay hot in at least L2, speeding up page walks.
# using array
# MADV_HUGEPAGE, read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 1 1000
result: 0
time: 5947.741408
Performance counter stats for './touchpage-array-argc.clang 1 1000':
5,951.40 msec task-clock:u # 1.000 CPUs utilized
9 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
687 page-faults # 0.115 K/sec
24,377,094,416 cycles # 4.096 GHz
14,397,054,228 instructions # 0.59 insn per cycle
2,183,878,846 mem_load_retired.l2_hit:u # 366.952 M/sec
313,684,419 mem_load_retired.l1_hit:u # 52.708 M/sec
13,218 mem_inst_retired.stlb_miss_loads:u # 0.002 M/sec
5.951530513 seconds time elapsed
5.944087000 seconds user
0.003284000 seconds sys
Notice ~1/10th the TLB misses, but that out of the same ~12G mem loads, only 2G of them hit in L2, probably thanks to successful HW prefetch. (The rest did hit in L3 though.) And that we only had 687 page faults; a combination of faultaround and hugepages made this much more efficient.
And note that the time taken is 3x higher because of the bottleneck on L3 bandwidth.
Write-init of the array gives us the worst of both worlds:
# using array
# MADV_HUGEPAGE (no apparent effect on BSS) and write-init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 3 1000
result: 0
time: 16510.222762
Performance counter stats for './touchpage-array-argc.clang 3 1000':
17,143.35 msec task-clock:u # 1.000 CPUs utilized
341 context-switches # 0.020 K/sec
0 cpu-migrations # 0.000 K/sec
95,218 page-faults # 0.006 M/sec
70,475,978,274 cycles # 4.111 GHz
17,989,948,598 instructions # 0.26 insn per cycle
634,015,284 mem_load_retired.l2_hit:u # 36.983 M/sec
107,041,744 mem_load_retired.l1_hit:u # 6.244 M/sec
37,715,860 mem_inst_retired.stlb_miss_loads:u # 2.200 M/sec
17.147615898 seconds time elapsed
16.494211000 seconds user
0.625193000 seconds sys
Lots of page faults. Also far more TLB misses.
std::vector version is basically the same as array:
strace
shows that madvise didn't work because I didn't align the pointer. glibc / libstdc++ new
tends to return a pointer that page-aligned + 16, with allocator bookkeeping in that first 16 bytes. For the array, I used alignas(4096)
to make sure I could pass it to madvise.
madvise(0x7f760d133010, 400000000, MADV_HUGEPAGE) = -1 EINVAL (Invalid argument)
So anyway, with my kernel tuning settings, it only tries to defrag memory for hugepages on madvise, and memory is pretty fragmented ATM. So it didn't end up using any hugepages.
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-vector-argv.clang 3 1000
result: 0
time: 16020.821517
Performance counter stats for './touchpage-vector-argv.clang 3 1000':
16,159.19 msec task-clock:u # 1.000 CPUs utilized
17 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
97,771 page-faults # 0.006 M/sec
66,146,780,261 cycles # 4.093 GHz
15,294,999,994 instructions # 0.23 insn per cycle
217,426,277 mem_load_retired.l2_hit:u # 13.455 M/sec
842,878,166 mem_load_retired.l1_hit:u # 52.161 M/sec
1,788,935 mem_inst_retired.stlb_miss_loads:u # 0.111 M/sec
16.160982779 seconds time elapsed
16.017206000 seconds user
0.119618000 seconds sys
I'm not sure why TLB misses is so much higher than for the THP read-only test. Maybe contention for memory access and/or eviction of cached page tables by touching more memory ends up slowing down pagewalks so TLB-prefetch doesn't keep up.
Out of the ~12G loads, HW prefetching was able to make about 1G of them hit in L1d or L2 cache.
std::array
. There was no improvement forstd::vector
– Orestestd::array
as inauto pv = std::make_unique<std::array<int, n>>(); auto& v = *pv;
– Zena0
with the default constructor forint
. If you'd usednew int[n]
to just dynamically allocate without touching the memory, you'd still get soft page faults and TLB misses. (Assuming you still do that at global scope, so the compiler can't see that it's definitely reading uninitialized memory: UB it could optimize away.) – Pigskinarray
case but not for thevector
case. And just changing things a bit was enough to turn off those optimizations – Orestemain
runs, allocating and writing all the memory to do all the page faults outside the timed region. This isn't compiler optimization, it's just timing error. If you timed the entire program, thestd::vector
version would probably be slower because it's first writing and then reading, instead of just reading from the BSS. (You get page faults that CoW map those BSS pages to the same physical zero page, so you get TLB misses + page faults but mostly data-cache hits.) – Pigskinarray
is zero-initialized, see en.cppreference.com/w/cpp/language/… – Ashliashliemain
doesn't know that unless maybe you compile withgcc -fwhole-program
. It's a non-const
global so anything else could have modified it after static initialization, beforemain
or inside any non-inline function call likechrono::steady_clock::now
. (GCC doesn't have a builtin fornow()
so it just has to treat it like any non-inline function that might modify any global.) Anyway, these things are true for both the vector and the array. godbolt.org/z/JT9u6B shows that printing the sum stops it from optimizing away like in OP's prev Q. – Pigskinarray
andvector
have static storage (not really true forvector
). @PeterCordes comment explain clearly that a bunch of things happen before the part of the code we are timing. This highlights the fact that we should always perform a dry run before actually timing, or at least time many runs in order to amortise such peripheral costs. – Girdle