Why is iterating though `std::vector` faster than iterating though `std::array`?
Asked Answered
O

2

6

I recently asked this question: Why is iterating an std::array much faster than iterating an std::vector?

As people quickly pointed out, my benchmark had many flaws. So as I was trying to fix my benchmark, I noticed that std::vector wasn't slower than std::array and, in fact, it was quite the opposite.

#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>

using namespace std;

constexpr int n = 100'000'000;
vector<int> v(n);
//array<int, n> v;

int main()
{
    int res = 0;
    auto start = chrono::steady_clock::now();
    for(int x : v)
        res += x;
    auto end = chrono::steady_clock::now();
    auto diff = end - start;
    double elapsed =
        std::chrono::duration_cast<
            std::chrono::duration<double, std::milli>
        >(end - start).count();
    printf("result: %d\ntime: %f\n", res, elapsed);
}

Things I've tried to improve from my previous benchmark:

  • Made sure I'm using the result, so the whole loop is not optimized away
  • Using -O3 flag for speed
  • Use std::chrono instead of the time command. That's so we can isolate the part we want to measure (just the for loop). Static initialization of variables and things like that won't be measured.

The measured times:

array:

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 99.554109

vector:

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 30.734491

I'm just wondering what I'm doing wrong this time.

Watch the disassembly in godbolt

Oreste answered 20/7, 2019 at 13:37 Comment(17)
It is very likely that dynamic allocation had the result of "priming" the memory pages that comprise the storage of the vector. The array is "stored" in the uninitialized memory, that gets demand paged-in, so iterating over the array generates page-faults that require the OS to go through the motions of allocating and clearing the underlying memory pages. Create a second loop, right after the first one, and then time the second loop instead of the first one. Better yet: move the first loop into a separate function in a different translation unit, and call it before the second loop.Deprive
@SamVarshavchik That's very interesting. I will try thatOreste
@SamVarshavchik Indeed, the second loop was 10 times faster for std::array. There was no improvement for std::vectorOreste
Now move the first loop into an external function, that gets compiled separately. The compiler might still be up to some optimization tricks.Deprive
You might try comparing with a heap-allocated std::array as in auto pv = std::make_unique<std::array<int, n>>(); auto& v = *pv;Zena
@SamVarshavchik Amazing, now both take the same time. Thank you very much :)Oreste
@Zena Yep, they have the same performance :)Oreste
The differences you've run into (before compensating for them) demonstrate how amazing compilers optimization is these days.Fairlie
@SamVarshavchik: dynamic allocation is still lazy, what's going on is that std::vector initializes the memory by writing 0 with the default constructor for int. If you'd used new int[n] to just dynamically allocate without touching the memory, you'd still get soft page faults and TLB misses. (Assuming you still do that at global scope, so the compiler can't see that it's definitely reading uninitialized memory: UB it could optimize away.)Pigskin
@Fairlie Sometimes. But if you think about it, it was optimizing something for the array case but not for the vectorcase. And just changing things a bit was enough to turn off those optimizationsOreste
@PeterCordes Thanks, just tried it and seem to be trueOreste
@Fairlie huh? No, the std::vector version is just doing a bunch of work before main runs, allocating and writing all the memory to do all the page faults outside the timed region. This isn't compiler optimization, it's just timing error. If you timed the entire program, the std::vector version would probably be slower because it's first writing and then reading, instead of just reading from the BSS. (You get page faults that CoW map those BSS pages to the same physical zero page, so you get TLB misses + page faults but mostly data-cache hits.)Pigskin
partial duplicate: Can static memory be lazily allocated?. I'm sure I've seen or written an answer to this question before, still looking for a more detailed duplicate about lazy BSS allocation leading to page-faults inside the timed region.Pigskin
The array is not zero initialized so maybe the addition the loop can't be optimized awayPhenazine
@Phenazine array is zero-initialized, see en.cppreference.com/w/cpp/language/…Ashliashlie
@MaximEgorushkin: but code-gen for main doesn't know that unless maybe you compile with gcc -fwhole-program. It's a non-const global so anything else could have modified it after static initialization, before main or inside any non-inline function call like chrono::steady_clock::now. (GCC doesn't have a builtin for now() so it just has to treat it like any non-inline function that might modify any global.) Anyway, these things are true for both the vector and the array. godbolt.org/z/JT9u6B shows that printing the sum stops it from optimizing away like in OP's prev Q.Pigskin
I'm not a mind reader but I think the OP's question was not specific to this specific case where array and vector have static storage (not really true for vector). @PeterCordes comment explain clearly that a bunch of things happen before the part of the code we are timing. This highlights the fact that we should always perform a dry run before actually timing, or at least time many runs in order to amortise such peripheral costs.Girdle
A
11

The difference is due to memory pages of array not being resident in process address space (global scope array is stored in .bss section of the executable that hasn't been paged in, it is zero-initialized). Whereas vector has just been allocated and zero-filled, so its memory pages are already present.

If you add

std::fill_n(v.data(), n, 1); // included in <algorithm>

as the first line of main to bring the pages in (pre-fault), that makes array time the same as that of vector.


On Linux, instead of that, you can do mlock(v.data(), v.size() * sizeof(v[0])); to bring the pages into the address space. See man mlock for full details.

Ashliashlie answered 20/7, 2019 at 14:25 Comment(18)
Or you could make array possibly faster than vector by only reading it with a warm-up loop instead of writing it, so the pages can stay CoW mapped to the same physical zero page. Then you can get L1d or L2 hits while looping over the huge array, and only some possible TLB misses depending on the microarchitecture.Pigskin
@PeterCordes After the write loop the pages are resident and its cached parts are valid. I do not see how reading instead of writing would change things, since the timing is done after that loop.Ashliashlie
related: Global variable seems to not occupy any memory space and Can static memory be lazily allocated?. Especially Why is the second loop over a static array in the BSS faster than the first? is specifically about a large static array, and the first loop eating all the page faults as a warm-up for the 2nd loop.Pigskin
Why is adding two std::vectors slower than raw arrays from new[]? is very good, this Q is a possible duplicate. (Except for the non-obvious fact that untouched BSS storage is basically like untouched memory from new)Pigskin
writing forces allocation of separate physical pages for each virtual page (CoW). Reading can allow all the virtual pages of the array to still share the same physical page. The logical array size is larger than L1d, L2, or even L3 cache sizes (100M * 4 bytes) so they won't still be hot after an init loop over the whole array. Data has to come from DRAM.Pigskin
@PeterCordes This is .bss section, it is not backed by pages in the executable image, so its pages are always CoW, aren't they? Oh, you probably mean that it uses the same read-only zero page for the entire .bss section. I see now, said a blind man.Ashliashlie
Yes, exactly. My understanding / mental-model is that there is a single physical 4k and/or 2M zero page that Linux uses for every known-zero-backed anonymous page across all processes / tasks. Any mapping of it is of course read-only. A read fault on an anonymous page just sets up this read-only CoW mapping, instead of preparing for a write.Pigskin
@PeterCordes On the other hand, IIRC, in AMD CPU documentation it talks about cache line tagging and specifically mentions that when the same hardware page is mapped multiple times in the same address space, that may result in a different cache line tag, hence resulting in cache miss, although the page is in cache.Ashliashlie
Bulldozer L1d cache is VIPT (virtually-indexed / physically-tagged), according to realworldtech.com/bulldozer/8. 16kiB / 4-way is small enough that it doesn't need anything special to avoid aliasing problems, Like Intel's and Ryzens's 32kiB / 8-way. I haven't checked AMD's optimization manual, but perhaps what you're remembering is for AMD K10 (Barcelona/Istanbul) which used a 64kiB / 2-way associative L1d cache and would need special tricks to avoid aliasing if it wants to be VIPT (to allow indexing in parallel with TLB). IDK, it's possible that BD or Zen does something.Pigskin
On my Skylake, I see a significant number of L2 hits when looping repeatedly over the unwritten array. godbolt.org/z/SxI2_j (I checked the asm to make sure my tricks stopped the repeat-loop from optimizing away). I used taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l2_hit:u,mem_load_retired.l2_miss:u,mem_load_retired.l1_hit:u -r4 ./touchpage-array-readinit-O3-native. I see only 808 page faults, so going read-only also encouraged more faultaround, or hugepages?Pigskin
@PeterCordes AMD Software Optimization Guide for Zen CPU: Linear aliasing occurs when two different linear addresses are mapped to the same physical address. This can cause performance penalties for loads and stores to the aliased cachelines. A load to an address that is valid in the L1 DC but under a different linear alias will see an L1 DC miss, which requires an L2 cache request to be made. The latency will generally be no larger than that of an L2 cache hit.Ashliashlie
Interesting, thanks. IDK why they do that. I think on Intel my L2 cache hits were only due to prefetch from L3. I think Linux might have been using a 2M transparent hugepage for the unwritten BSS because I still get some L2 misses but no L3 misses. Read-init makes the read loop about 2.8x faster than write-init where the read loop does get some L3 misses, and more TLB misses. Also more page faults. I can see from /proc/PID/smaps that there were no anon hugepages used for the BSS after write init.Pigskin
@PeterCordes - it's an implementation tradeoff. It's not easy to make a fast, low latency, low power VIPT L1 cache, so AMD uses something called "microtagging". This lets you do the L1 lookup with the V address only, moving the TLB out of the critical path. The downside is that virtual aliases are not resolved, and so you you can have only one V alias of a given P address in the cache at once.Tolan
Note for example that AMD has 4-cycle latency in many more scenarios than Intel, so that is probably one positive outcome of the design.Tolan
@BeeOnRope: Interesting, thanks for the info on AMD. I just tried madvise(..., MADV_NOHUGEPAGE) on my Skylake, and confirmed that Intel's VIPT cache really can give all L1d hits. (But at the cost of more TLB misses). godbolt.org/z/91cpyE run with no args for read-init, or with argc>1 for write-init. I get basically all mem loads being mem_load_retired.l1_hit:u for read-init. So this confirms that Intel's design is definitely not like AMD's.Pigskin
Yes, Intel's recent designs are all pure VIPT AFAIK.Tolan
@Tolan IMO, the key part is linear-address-based microtag (utag). The utag is a hash of the load's linear address (V).Ashliashlie
@MaximEgorushkin: seg:off to linear happens before virt(linear) to physical translation. So yes, if you're building a cache with microtagging, you'd use linear virtual addresses, not segment:offset. Non-zero segment bases (FS or GS) can pay the extra latency for an adder before starting the normal process.Pigskin
P
6

Memory mapping/allocating is lazy: the first access to a page will cause a page fault exception (#PF on x86). This includes the BSS, as well as file-backed mappings like the text segment of your executable. These page faults are "valid" so they don't result in a SIGSEGV being delivered; instead the kernel allocates a physical page if necessary and wires up the hardware page tables so the load or store can rerun and not fault the 2nd time.

This is expensive, especially if the kernel doesn't "fault-around" and prepare multiple pages during one page fault. (Especially with Spectre + Meltdown mitigation enabled making user<->kernel round trips more expensive on current x86-64 hardware.)

You're letting std:vector's constructor write zeros to the array after dynamic allocation1. std::vector does all the page-faulting outside your timed loop. This happens before main, while the implementation is running constructors for static objects.

But the array is zero-initialized so it gets placed in the BSS. The first thing to touch it is your loop. Your array<> loop pays for all the page faults inside the timed region.

If you used new int[n] to dynamically allocate but not initialize a block of memory, you'd see the same behaviour as from your static array<>. (Maybe slightly better if Linux is more willing to use transparent hugepages for a dynamic allocation instead of the BSS mapping.)



Footnote 1 std::vector in libstdc++ and libc++ is too stupid to take advantage of getting already-zeroed pages from the OS, like it could if it used calloc or equivalent. It would be possible if the library provided a new/delete-compatible allocator for zeroed memory.

C++ new/delete is crippled vs. malloc/free/calloc/realloc. I have no idea why ISO C++ left out calloc and realloc: both are very useful for large allocations, especially realloc for resizing a std::vector of trivially-copyable objects that might have room to grow its mapping without copying. But since new/delete aren't guaranteed to be compatible with malloc/free, and new is replaceable, libraries can't very easily use calloc and realloc even under the hood.


Another factor: read-only leaves pages CoW mapped to the same physical zero page

When lazy allocation is triggered by a read (instead of write), it reads as zero. (BSS pages read as zero, new pages from mmap(MAP_ANONYMOUS) read as all-zero.)

The (soft) page fault handler that wired up the HW page table didn't need to actually allocate a physical page aka page-frame to back that virtual page. Instead, Linux maps clean (unwritten) anonymous pages to a single physical zeroed page. (This applies across all tasks.)

If we make multiple passes over the array, this leads to the curious situation where we can get TLB misses but L1d or L3 hits (depending on hugepage or not) because we have multiple virtual pages pointing to the same physical location.

(Some CPUs, e.g. AMD Ryzen, use micro-tagging in the L1d cache to save, at the cost of the cache only being able to hit for one virtual address even if the same memory is mapped to multiple virtual addresses. Intel CPUs use true VIPT L1d caches and really can get this effect),

I made a test program for Linux that will use madvise(MADV_HUGEPAGE) (to encourage the kernel to defrag memory for hugepages) or madvise(MADV_NOHUGEPAGE) (to disable hugepages even for the read-only case).

For some reason Linux BSS pages don't use hugepages when you write them. Only reading them does use 2M hugepages (too big for L1d or L2, but does fit in L3. But we do get all TLB hits). It's hard to see this in /proc/PID/smaps because unwritten memory doesn't show up as "resident" at all. (Remember it's physically backed by a system-wide shared region of zeroes).

I made some changes to your benchmark code to rerun the sum loop multiple times after an init pass that either reads or writes the array, according to command-line args. The repeat-loop makes it run longer so we can get more precise timing, and to amortize the init so we get useful results from perf.

#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>
#include <sys/mman.h>

using namespace std;

constexpr int n = 100'000'000;
//vector<int> v(n);
alignas(4096) array<int, n> v;

//template<class T>
__attribute__((noinline))
int toucharray(volatile int *vv, int write_init) {
        int res=vv[0];
        for(int i=32 ; i<n ; i+=128)
                if(write_init)
                    vv[i] = 0;
                else
                    res += vv[i];
//      volatile int sum = res;  // noinline is fine, we don't need to stop multiple calls from CSEing
        return res;
}

template <class T>
__attribute__((noinline,noclone))
int sum_container(T &vv) {
    unsigned int res=0;
    for(int x : vv)
        res += x;
    __attribute__((used)) static volatile int sink;
    sink = res;  // a side-effect stops IPA from deciding that this is a pure function
    return res;
}

int main(int argc, char**argv)
{
    int write_init = 0;
    int hugepage = 0;
    if (argc>1) {
            hugepage = argv[1][0] & 1;
            write_init = argv[1][0] & 2;
    }
    int repcount = 1000;
    if (argc>2)
            repcount = atoi(argv[2]);

// TODO: option for no madvise.
    madvise(v.data(), n*sizeof(v[0]), MADV_SEQUENTIAL);
    madvise(v.data(), n*sizeof(v[0]), hugepage ? MADV_HUGEPAGE : MADV_NOHUGEPAGE);  
    madvise(v.data(), n*sizeof(v[0]), MADV_WILLNEED); 
 // SEQ and WILLNEED probably only matter for file-backed mappings to reduce hard page faults.
 //  Probably not encouraging faultahead / around for lazy-allocation soft page fault

    toucharray(v.data(), write_init);

    int res = 0;
    auto start = chrono::steady_clock::now();
    for(int i=0; i<repcount ; i++)
        res = sum_container(v);
    auto end = chrono::steady_clock::now();
    double elapsed =
        std::chrono::duration_cast<
            std::chrono::duration<double, std::milli>
        >(end - start).count();
    printf("result: %d\ntime: %f\n", res, elapsed);
}

best case: clang++ -O3 -march=native (skylake) actually unrolls with multiple accumulators, unlike gcc -funroll-loops which does a silly job.

On my Skylake i7-6700k with DDR4-2666 DRAM, configured for 4.2GHz max turbo and governor=performance -

# using std::array<int,n>
# 0&1 = 0 -> MADV_NOHUGEPAGE.  0&2 = 0 -> read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 0 1000
result: 0
time: 1961.952394

 Performance counter stats for './touchpage-array-madv-nohuge-argc.clang 0 1000':

          2,017.34 msec task-clock:u              #    1.000 CPUs utilized          
                50      context-switches          #    0.025 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            97,774      page-faults               #    0.048 M/sec                  
     8,287,680,837      cycles                    #    4.108 GHz                    
    14,500,762,859      instructions              #    1.75  insn per cycle         
            13,688      mem_load_retired.l2_hit:u #    0.007 M/sec                  
    12,501,329,912      mem_load_retired.l1_hit:u # 6196.927 M/sec                  
           144,559      mem_inst_retired.stlb_miss_loads:u #    0.072 M/sec                  

       2.017765632 seconds time elapsed

       1.979410000 seconds user
       0.036659000 seconds sys

Notice considerable TLB misses (mem_inst_retired.stlb_miss_loads:u counts 2nd-level TLB misses in user-space). And 97k page faults. That's pretty much exactly as many 4k pages as it takes to cover the 100M * 4 = 400MB array, so we got 1 fault per page with no pre-fault / fault-around.

Fortunately Skylake has two page-walk units so it can be doing two speculative page-walks in parallel. Also, all the data access is hitting in L1d so page-tables will stay hot in at least L2, speeding up page walks.

# using array
# MADV_HUGEPAGE,  read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 1 1000
result: 0
time: 5947.741408

 Performance counter stats for './touchpage-array-argc.clang 1 1000':

          5,951.40 msec task-clock:u              #    1.000 CPUs utilized          
                 9      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               687      page-faults               #    0.115 K/sec                  
    24,377,094,416      cycles                    #    4.096 GHz                    
    14,397,054,228      instructions              #    0.59  insn per cycle         
     2,183,878,846      mem_load_retired.l2_hit:u #  366.952 M/sec                  
       313,684,419      mem_load_retired.l1_hit:u #   52.708 M/sec                  
            13,218      mem_inst_retired.stlb_miss_loads:u #    0.002 M/sec                  

       5.951530513 seconds time elapsed

       5.944087000 seconds user
       0.003284000 seconds sys

Notice ~1/10th the TLB misses, but that out of the same ~12G mem loads, only 2G of them hit in L2, probably thanks to successful HW prefetch. (The rest did hit in L3 though.) And that we only had 687 page faults; a combination of faultaround and hugepages made this much more efficient.

And note that the time taken is 3x higher because of the bottleneck on L3 bandwidth.


Write-init of the array gives us the worst of both worlds:

# using array
# MADV_HUGEPAGE (no apparent effect on BSS)  and write-init

taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 3 1000
result: 0
time: 16510.222762

 Performance counter stats for './touchpage-array-argc.clang 3 1000':

         17,143.35 msec task-clock:u              #    1.000 CPUs utilized          
               341      context-switches          #    0.020 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            95,218      page-faults               #    0.006 M/sec                  
    70,475,978,274      cycles                    #    4.111 GHz                    
    17,989,948,598      instructions              #    0.26  insn per cycle         
       634,015,284      mem_load_retired.l2_hit:u #   36.983 M/sec                  
       107,041,744      mem_load_retired.l1_hit:u #    6.244 M/sec                  
        37,715,860      mem_inst_retired.stlb_miss_loads:u #    2.200 M/sec                  

      17.147615898 seconds time elapsed

      16.494211000 seconds user
       0.625193000 seconds sys

Lots of page faults. Also far more TLB misses.

std::vector version is basically the same as array:

strace shows that madvise didn't work because I didn't align the pointer. glibc / libstdc++ new tends to return a pointer that page-aligned + 16, with allocator bookkeeping in that first 16 bytes. For the array, I used alignas(4096) to make sure I could pass it to madvise.

madvise(0x7f760d133010, 400000000, MADV_HUGEPAGE) = -1 EINVAL (Invalid argument)

So anyway, with my kernel tuning settings, it only tries to defrag memory for hugepages on madvise, and memory is pretty fragmented ATM. So it didn't end up using any hugepages.

taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-vector-argv.clang 3 1000
result: 0
time: 16020.821517

 Performance counter stats for './touchpage-vector-argv.clang 3 1000':

         16,159.19 msec task-clock:u              #    1.000 CPUs utilized          
                17      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            97,771      page-faults               #    0.006 M/sec                  
    66,146,780,261      cycles                    #    4.093 GHz                    
    15,294,999,994      instructions              #    0.23  insn per cycle         
       217,426,277      mem_load_retired.l2_hit:u #   13.455 M/sec                  
       842,878,166      mem_load_retired.l1_hit:u #   52.161 M/sec                  
         1,788,935      mem_inst_retired.stlb_miss_loads:u #    0.111 M/sec                  

      16.160982779 seconds time elapsed

      16.017206000 seconds user
       0.119618000 seconds sys

I'm not sure why TLB misses is so much higher than for the THP read-only test. Maybe contention for memory access and/or eviction of cached page tables by touching more memory ends up slowing down pagewalks so TLB-prefetch doesn't keep up.

Out of the ~12G loads, HW prefetching was able to make about 1G of them hit in L1d or L2 cache.

Pigskin answered 21/7, 2019 at 6:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.