Why is std::fill(0) slower than std::fill(1)?

Asked 2/3, 2017 at 15:4 Answered 10/7, 2017 at 17:59

Solved c++performance x86 compiler-optimization memset

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:

5.8 GiB/s vs 7.5 GiB/s

However, the results are different for smaller data sizes, where fill(0) is faster:

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.

The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.

I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.

Here is the code to reproduce:

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>

using value = int;
using vector = std::vector<value>;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute__((noinline)) fill0(vector& v) {
    std::fill(v.begin(), v.end(), 0);
}

void __attribute__((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}

void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
    {
        vector v(data_size / (sizeof(value) * nthreads));
        auto repeat = write_size / data_size;
#pragma omp barrier
        auto t0 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill0(v);
#pragma omp barrier
        auto t1 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill1(v);
#pragma omp barrier
        auto t2 = omp_get_wtime();
#pragma omp master
        std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
                  << write_size / (t2 - t1) << "\n";
    }
}

int main(int argc, const char* argv[]) {
    std::cout << "size,nthreads,fill0,fill1\n";
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, 1);
    }
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, omp_get_max_threads());
    }
    for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
        bench(max_data_size, nthreads);
    }
}

Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

Underwear answered 2/3, 2017 at 15:4 Comment(12)

What is the data size when you are comparing the number of threads? – Nichani 3/3, 2017 at 22:26

@GavinPortwood 4 GiB, so in memory, not cache. – Underwear 3/3, 2017 at 23:41

Then there must be something wrong with the second plot, the weak-scaling. I can't imagine it would take more than two or so threads to saturate memory bandwidth for a loop with minimal intermediate operations. In fact, you haven't identified the threads count where the bandwidth saturates even at 24 threads. Can you show that it does level out at some finite thread count? – Nichani 4/3, 2017 at 18:19

@GavinPortwood On this system it is in accordance with other benchmark numbers that the bandwidth is saturated at ~7 of 12 core for one socket. See for example the stream numbers, where there is a factor of ~5 between single core and all cores. What I cannot easily explain is the behavior of the second socket (13-24 threads). I would have expected a similar slope and saturation as for the first socket (1-12 threads). I assume this has something to do with asymmetric thread distribution. – Underwear 5/3, 2017 at 9:52

@GavinPortwood I reran the experiments with different affinity settings (spreading across the two sockets) and updated the picture. You see the saturation better. But the main pattern remains fill(1) has a higher slope but a much lower maximum bandwidth of fill(0). – Underwear 5/3, 2017 at 10:32

I suspect the anomalous scaling in your original experiment (on the second socket) is related to non-homogenous memory allocation and the resulting QPI communication. That can be verified with Intel's "uncore" PMUs (i think) – Nichani 5/3, 2017 at 22:49

I am slowly starting to look into your question https://mcmap.net/q/14526/-enhanced-rep-movsb-for-memcpy/2542702 – Gerhard 11/4, 2017 at 11:32

FWIW - you found the code difference in your answer and I think Peter Cordes has the answer below: that rep stosb is using a non-RFO protocol which halves the number of transactions needed to do a fill. The rest of the behavior mostly falls out of that. There is one other disadvantage the fill(1) code has: it can't use 256-bit AVX stores because you aren't specifying -march=haswell or whatever, so it has to fall back to 128-bit code. fill(0) which calls memset get the advantage of libc dispatching that calls the AVX version on your platform. – Wilcher 10/7, 2017 at 19:47

You could try with the -march argument at compile to do somewhat more of an apples-to-apples comparison: this will mostly help for small buffers that fit in some level of the cache, but not the larger copies. – Wilcher 10/7, 2017 at 19:47

@Wilcher -march=native gives a vmovdq loop, which only seems to increase L1 performance, though not quite to the level of rep stos. – Underwear 10/7, 2017 at 22:7

Right - but was it using ymm or xmm regs? That's the key difference (256-bit vs 128-bit). I guess your results make sense - I think the L2 has a bandwidth of 32 bytes/cycle, which would seem to need 32 byte stores (at the max of 1 per cycle) to saturate it, but without NT stores the bandwidth is split in half between the actual stores and the RFO requests, so 16 bytes of reads is "enough" to saturate even L2 (same logic applies for L3, more or less). L1, on the hand, can sustain 32 bytes of writes per cycle, so 256-bit is a win there. – Wilcher 10/7, 2017 at 23:30

That was ymm, I added the results to my answer, also including intrinsic non-temporal. – Underwear 11/7, 2017 at 14:57

From your question + the compiler-generated asm from your answer:

fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.

Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)

Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell. Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the @BeeOnRope's excellent ERMSB answer for updates on this.

gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).

@BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.

movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)

Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).

As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).

movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).

@osgx posted some interesting links in comments:

Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture

See also other stuff in the x86 tag wiki.

Langille answered 10/7, 2017 at 17:59 Comment(27)

Although rep movsd isn't officially covered by the ermsb feature, all recent Intel CPUs (and apparently Ryzen) seem to implement it using the same protocol and it ends up generally having indistinguishable performance. Still there is little reason to use since rep movsb pretty much offers a superset of the functionality and who knows how Intel and AMD will optimize them in the future, but in the meantime at least existing code that has rep movsd effectively gets the benefit of ermsb. – Wilcher 10/7, 2017 at 19:29

The behavior described above of rep movsb versus an explicit loop of movaps on a single core across various buffer sizes is pretty consistent with what we have seen before on server cores. As you point out, the competition is between a non-RFO protocol and the RFO protocol. The former uses less bandwidth between all cache levels, but especially on server chips has a long latency handoff all the way to memory. Since a single core is generally concurrency limited, the latency matters, and the non-RFO protocol wins, which is what you see in the region beyond the 30 MB L3. – Wilcher 10/7, 2017 at 19:37

... in the middle of the graph that fits in L3, however, the long server uncore to memory handoff apparently doesn't come into play, so the read reduction offered by non-RFO wins (but actually it's interesting to compare this to NT stores: would they show the same behavior, or is rep stosb able to stop the write at L3 rather than go all the way to memory)? FWIW, the situation for rep stosb for fill is relatively better, empirically, than for rep movsb for memcpy. Possibly because the former has a 2:1 advantage in traffic versus 3:2 for the latter. – Wilcher 10/7, 2017 at 19:42

Some links to measurement on the topic in this answer under the "Latency Bound Platforms" heading. It is talking about movsb not stosb, but the same general pattern applies. – Wilcher 10/7, 2017 at 21:24

This answer is most excellent, and @Wilcher finally clarifies the anomaly for me. I saw your excellent answer before, but now I feel I understood it :). – Underwear 10/7, 2017 at 22:9

I tried movntps and if I'm using it correctly, it shows the memory-bandwidth among all data sizes - so it doesn't benefit from caches at all. But for a single thread, that is twice the memory bandwidth of movaps, and for 24 threads it's slightly higher than rep stosb. – Underwear 10/7, 2017 at 22:45

@Underwear - ok that is a very interesting result for movntps. It makes sense: movntps is saying "force this write all the way to memory" which means you will generally get the same behavior even for smaller sizes. rep movsb on the other hand is going to be size-aware, so will only switch into non-RFO protocol at some threshold. A real world implementation of memset or fill would also likely switch over to NT only after some threshold (often "50% of the L3 cache size" or something like that). – Wilcher 10/7, 2017 at 23:5

@BeeOnRope: Can rep stos avoid RFO without force-evicting lines from cache, or bypassing the cache? Those are two separate things, so couldn't there be a non-RFO protocol that leaves data in cache? – Langille 11/7, 2017 at 0:27

@Zulan: just to confirm, you used 128b SSE2 or AVX _mm_stream_ps ([v]movntps [mem], xmm), not AVX 256b _mm256_stream_ps (vmovntps [mem], ymm), right? – Langille 11/7, 2017 at 4:20

@PeterCordes actually, there's no performance difference between 128/256. Please see the update to my answer for detailed results. – Underwear 11/7, 2017 at 14:59

@Zulan: That makes sense, since MOVNT is always going straight into line-fill buffers and then memory. And 128b vectors are enough to saturate that easily. I guess the only time vmovntps ymm would be an advantage is when storing the results of a 256b-vectorized computation that was CPU bound (but would be memory-bound if you didn't use NT stores). Unpacking to 128b stores would take extra shuffles, so obviously you just want to use 256b NT stores if your data is already in 256b vectors. – Langille 11/7, 2017 at 15:3

@PeterCordes did you ever get an answer to your question to @BeeOnRope? "Can rep stos avoid RFO without force-evicting lines from cache, or bypassing the cache? Those are two separate things, so couldn't there be a non-RFO protocol that leaves data in cache?" Also do you know if rep stosb is ever implemented with non-temporal stores? – Abert 11/1, 2021 at 16:39

@Wilcher (or anyone else since BeeOnRope seems inactive) why does rep movsb have a longer latency handoff on server chips? The post you linked seems to indicate that the increase/decrease in latency is due to handoff time from LFB to memory (dma device?) which is dependent on whether the LFB cache line is in LLC or L2. Since rep movsb prefetches (and that post indicates it prefetches better than a movaps loop) wouldn't the handoff latency be low or equal to movaps loop? – Abert 11/1, 2021 at 17:15

@Noah: rep movsb can do no-RFO stores (since P6), like movntps except it doesn't force eviction from cache. Since IvB (ERMSB), they're even weakly-ordered. It's still different so we can't say for sure, but NT stores also have a similar slower handoff probably for some similar internal reason. (Which I don't particularly understand.) – Langille 11/1, 2021 at 17:48

@PeterCordes what do you mean "doesn't force eviction from cache"? Do you mean it a memcpy with rep movsb will leave cache in state it was before the memcpy? That it doesn't invalidate lines on other cores? or that it will go through cache on the core it is doing the memcpy on (i.e only invalidating other lines rather than loading them fully with an RFO)? or something different all together? It seems it prefetches (so goes through cache in some way unlike NT store) but I am having trouble making a mental model for what it does internally. – Abert 11/1, 2021 at 17:54

@PeterCordes I am considering making a post asking about the internals of rep movsb to try and understand all the information in this post, @BeeOnRope's comments, and @BeeOnRope's other post that they linked. Or am I missing something obvious and just misreading? – Abert 11/1, 2021 at 17:56

@Noah: No, of course it doesn't break coherence; it invalidates instead of RFOing before doing a full-line store that makes it pointless to have read the old contents of the line, saving bandwidth. An NT store guarantees that the line won't be present in any caches, like movaps + clflush (but diff perf). After rep movsb, (part of) the destination can still be hot in this core's caches when you're done, unlike a movnt store loop. That's part of what @ Bee's ERMSB answer explains, isn't it? – Langille 11/1, 2021 at 18:3

@PeterCordes I see. Will it actively load the destinations into cache or does it just leave what was there intact? I.e memcpy on hot src/dst with rep movsb src and dst will both stay in cores caches. will memcpy on cold src/dst with rep movsb not load either or them or will it load both (pretty sure the latter, just want to verify as rep movsb seems different from both movaps and movnt). As a side note does vmodqa zmm also bypass the RFO and just invalidate or is rep movsb special? – Abert 11/1, 2021 at 18:54

@Noah: it should be obvious that after any store, the cache line will definitely not still be hot in some other core's private cache. There's no shared bus for a core to broadcast the new data on (instead it's directory-based coherence with L3 tags or similar structure as the directory). The storing core needs exclusive ownership before updating its own L1d, by invalidating other copies, and has to wait for an acknowledgement of the invalidation. It has to maintain coherence if 2 cores triy to rep movsb to the same destination at once. – Langille 12/1, 2021 at 2:46

@Noah: Re: full-line ZMM stores avoiding an RFO: good question, I don't know but it's 100% possible. Internally it could work exactly like a full-line store from rep stos / rep movs. It's something I've wondered, but I forget if I ever found an answer, or what it was for different microarchitectures. (It's an optimization that can of course be added to a later design if SKX or KNL didn't have it.) There might be some reason it's only worth it for a long stream of stores, like somehow taking longer to do something, maybe delaying later stores and stalling the store buffer. – Langille 12/1, 2021 at 2:50

@Noah: Forgot to mention: rep movs / rep stos might even be adaptive in strategy with some large-size cutoff, like maybe using actual NT stores that bypass cache for very large stores. Or doing something simpler for small copies that only touch a couple lines. But more microcode branching could increase startup overhead so they wouldn't do that without good reason, but it's possible and something to keep in mind if trying to figure out what they do with experiments with perf counters and small to medium copies. – Langille 12/1, 2021 at 2:53

@PeterCordes re: "it should be obvious that after any store, the cache line will definitely not still be hot in some other core's private cache" I meant in the core doing the memcpy. But your next point that its unknown if the microcode will branch for NT stores indicates that the affect on the cache of the core doing the memcpy is unknown/dependent on rcx. Ill post a question about the zmm case (or any case where the store buffer could known a full cache line is being overwritten). – Abert 12/1, 2021 at 4:24

@Noah: Oh, I think I misread your first comment. AFAIK, rep movsb doesn't use cache-bypassing loads (because that would not be coherent, and there'd be nowhere to prefetch into; the OoO exec window isn't big enough to hide the full load latency from DRAM). So the microcoded loads are basically just normal loads like vmovups. Possibly with something like prefetchnta for some prefetch distance to reduce pollution from loads, but I wouldn't bet on it. So after rep movsb, you can expect the end of the source data to be hot in cache, too. – Langille 12/1, 2021 at 4:53

Microcode can only use uops that the back-end supports, and those uops have to go through the pipeline normally. There isn't (unfortunately) a dedicated memcpy state machine (like a page walker) that rep movsb could offload to (a decision that Andy Glew regretted after the fact). I've been hoping to hear details of Ice Lake's "fast short rep movs" support, whether it's just better microcode or dedicated hardware state-machine that can access cache in parallel with the normal load/store units. – Langille 12/1, 2021 at 4:53

@PeterCordes hmm I am totally unable to get less RFO requests using rep movsb. With vmovntdq I see fewer than with vmovdqa but I reliably see more with rep movsb than either of the other two. I don't think the issue is ICL as the change appears to only be for short copied. Made a godbolt link with my benchmark. Any idea what I'm messing up? – Abert 19/2, 2021 at 8:6

@Noah: IDK, seems strange. Ask a new question. – Langille 19/2, 2021 at 8:27

@PeterCordes question if you have any ideas :P . Also in doing these tests I didn't see any reduction in RFO requests when using zmm registers (i.e temporal store with zmm get same number of RFO as temporal stores with ymm / xmm and likewise for non-temporal stores) – Abert 19/2, 2021 at 9:32

I'll share my preliminary findings, in the hope to encourage more detailed answers. I just felt this would be too much as part of the question itself.

The compiler optimizes fill(0) to a internal memset. It cannot do the same for fill(1), since memset only works on bytes.

Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction:

rep    stos %al,%es:(%rdi)

Wheres the manual loop compiles down to an actual 128-bit instruction:

add    $0x1,%rax                                                                                                       
add    $0x10,%rdx                                                                                                      
movaps %xmm0,-0x10(%rdx)                                                                                               
cmp    %rax,%r8                                                                                                        
ja     400f41

Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop. Strangely,for a std::vector<char>, gcc begins to optimize also fill(1). The Intel compiler does not, despite the memset template specification.

Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes.

I would appreciate any further insight into the issue and the related micro-architecture details. In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache.

Update:

Here is a result in comparison with

fill(1) that uses -march=native (avx2 vmovdq %ymm0) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels.
Variants of 32, 128 and 256 bit non-temporal stores. They perform consistently with the same performance regardless of the data size. All outperform the other variants in memory, especially for small numbers of threads. 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse.

For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory.

Single threaded bandwidth:

Aggregate bandwidth in memory:

Here is the code used for the additional tests with their respective hot-loops:

void __attribute__ ((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}
┌─→add    $0x1,%rax
│  vmovdq %ymm0,(%rdx)
│  add    $0x20,%rdx
│  cmp    %rdi,%rax
└──jb     e0


void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
    for (auto& elem : v) {
       _mm_stream_si32(&elem, 1);
    }
}
┌─→movnti %ecx,(%rax)
│  add    $0x4,%rax
│  cmp    %rdx,%rax
└──jne    18


void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
    assert((long)v.data() % 32 == 0); // alignment
    const __m128i buf = _mm_set1_epi32(1);
    size_t i;
    int* data;
    int* end4 = &v[v.size() - (v.size() % 4)];
    int* end = &v[v.size()];
    for (data = v.data(); data < end4; data += 4) {
        _mm_stream_si128((__m128i*)data, buf);
    }
    for (; data < end; data++) {
        *data = 1;
    }
}
┌─→vmovnt %xmm0,(%rdx)
│  add    $0x10,%rdx
│  cmp    %rcx,%rdx
└──jb     40


void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
    assert((long)v.data() % 32 == 0); // alignment
    const __m256i buf = _mm256_set1_epi32(1);
    size_t i;
    int* data;
    int* end8 = &v[v.size() - (v.size() % 8)];
    int* end = &v[v.size()];
    for (data = v.data(); data < end8; data += 8) {
        _mm256_stream_si256((__m256i*)data, buf);
    }
    for (; data < end; data++) {
        *data = 1;
    }
}
┌─→vmovnt %ymm0,(%rdx)
│  add    $0x20,%rdx
│  cmp    %rcx,%rdx
└──jb     40

Note: I had to do manual pointer calculation in order to get the loops so compact. Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer.

Underwear answered 2/3, 2017 at 15:4 Comment(11)

rep stos is microcoded in most CPUs (find "REP STOS" and its "Fused µOps column" in agner.org/optimize/instruction_tables.pdf tables of Haswell around page 189). Also check CPUID EAX=7, EBX, bit 9 "erms Enhanced REP MOVSB/STOSB" (grep erms /proc/cpuinfo) which is flag of additionally optimized microcode for rep stos since Nehalem: intel.com/content/dam/www/public/us/en/documents/manuals/… "2.5.6 REP String Enhancement" & 3.7.6 ERMSB. You should compare PMU counters to get some information about implementation. – Kenosis 12/3, 2017 at 11:46

Also, check stackoverflow.com/a/26256216 for different optimized memcpy/set (and limits of CPU) and try to ask specific questions on software.intel.com/en-us/forums to get some attention from software.intel.com/en-us/user/545611. The actual microcode of Haswell may have some problems in NUMA case with coherency protocol, when some of the memory is allocated in memory of different numa node (socket) or memory just can be allocated on other node, so multi-socket coherency protocol is active when cachelines are allocated. Also check errata of Haswell about its microcode. – Kenosis 12/3, 2017 at 11:55

Sometimes there are authors of rep s* microcode in intel forums: software.intel.com/en-us/forums/… "Seth Abraham (Intel) Fri, 08/04/2006": "It is still possible to write code that is faster still, but the performance gap is not as large, and it is a little harder than it used to be to beat REP MOVSD/STOSD... You can still beat REP MOVSD/STOSD with such code". It can be interesting to rewrite your fill(1) case by hand with rep stosd and compare speed with rep mov. Also: where does your vector allocates its memory, using mmap? – Kenosis 12/3, 2017 at 12:0

Smaller sizes of vector v may be allocated in stack (up to and including 131072 bytes) which is wrong for NUMA; and bigger vectors are probably allocated by mmap which is only correct way for NUMA. When the memory page is accessed in first time for writing, it will be allocated on some NUMA node or another. Always write to memory from the same NUMA node where you will work with it. For stack your memory placement may be from previous iteration of bench, which is incorrect for other size of bench. Same can be true for some sizes of malloc, when glibc does not return memory back to OS. – Kenosis 12/3, 2017 at 12:42

@Kenosis thank you for the excellent input. Yes the CPU supports erms. I'm quite sure that std::vector allocates though malloc / mmap. Since the vector is only declared thread private, and the thread pinned, it will be both allocated and first-touched by the NUMA node eventually uses it. I would strongly hope that the (thread) stack is also allocated on the NUMA node that runs the thread. – Underwear 21/3, 2017 at 12:18

Welcome to the NUMA world. vector is allocated with malloc, used correctly with first touch placing, but its deallocation with free will just mark memory as unused, without returning memory back to OS - there will be no next touch for next iteration (some outdated info on malloc in stackoverflow.com/questions/2215259 and some in stackoverflow.com/a/42281428 "Since 2007 (glibc 2.9 and newer)"). With glibc do call malloc_trim() between bench and the freed memory will be marked as free to OS and retouched for NUMA. Stack is allocated by main thread... – Kenosis 21/3, 2017 at 13:26

@Kenosis Adding malloc_trim() after each bench did not result in any significant changes of the performance. I don't see any effects in my results that indicate trouble with NUMA. Even if it was the case, then fill(0) and fill(1) would be affected the same way! Consider the single socket results in this chart (up to 12 threads). – Underwear 21/3, 2017 at 20:7

Microcoded fill(0) still slower than manual loop of fill(1). There is still NUMA cache coherency (even when NUMA memory placement is correct) which can make microcoded variant slower, can you rerun code not on NUMA machine (when second socket is disabled / or not present)? – Kenosis 21/3, 2017 at 20:11

Would numactl --membind=0 --cpunodebind=0 suffice? Can't really disable a socket on these systems. – Underwear 21/3, 2017 at 20:21

Zulan, no, software will not disable cache coherency between sockets (second socket should not be booted/QPI disabled). Your E5-2680 v3 is 12 core haswell in MCC (Medium Core Count) die (anandtech.com/show/8679/…) and there is cache snooping messages on access: frankdenneman.nl/2016/07/11/…. They are sent both in the ring of local socket and over QPI to next socket. Some versions of Xeons may use "directory" to limit snooping message storms in memory-bound tasks like this one. – Kenosis 21/3, 2017 at 20:30

You can also check Intel MLC - software.intel.com/en-us/articles/intelr-memory-latency-checker for measuring maximal bandwidth of the tested systems as mlc --bandwidth_matrix and mlc --peak_bandwidth. Also - paper about your Haswell and its cache coherency tu-dresden.de/zih/forschung/ressourcen/dateien/… – Kenosis 21/3, 2017 at 20:41

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags