How can I accurately benchmark unaligned access speed on x86_64?

Asked 16/7, 2017 at 12:44 Answered 17/7, 2017 at 22:47

Solved performance x86 x86-64 benchmarking inline-assembly

In an answer, I've stated that unaligned access has almost the same speed as aligned access a long time (on x86/x86_64). I didn't have any numbers to back up this statement, so I've created a benchmark for it.

Do you see any flaws in this benchmark? Can you improve on it (I mean, to increase GB/sec, so it reflects the truth better)?

#include <sys/time.h>
#include <stdio.h>

template <int N>
__attribute__((noinline))
void loop32(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("mov     (%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x04(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x08(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x0c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x10(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x14(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x18(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x1c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x20(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x24(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x28(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x2c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x30(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x34(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x38(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x3c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x40(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x44(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x48(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x4c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x50(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x54(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x58(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x5c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x60(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x64(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x68(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x6c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x70(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x74(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x78(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x7c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x80(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x84(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x88(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x8c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x90(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x94(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x98(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x9c(%0), %%eax" : : "r"(v) :"eax");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop64(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("mov     (%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x08(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x10(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x18(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x20(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x28(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x30(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x38(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x40(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x48(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x50(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x58(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x60(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x68(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x70(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x78(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x80(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x88(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x90(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x98(%0), %%rax" : : "r"(v) :"rax");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop128a(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("movaps     (%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x10(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x20(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x30(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x40(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x50(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x60(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x70(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x80(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x90(%0), %%xmm0" : : "r"(v) :"xmm0");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop128u(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("movups     (%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x10(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x20(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x30(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x40(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x50(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x60(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x70(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x80(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x90(%0), %%xmm0" : : "r"(v) :"xmm0");
        v += 160;
    }
}

long long int t() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (long long int)tv.tv_sec*1000000 + tv.tv_usec;
}

int main() {
    const int ITER = 10;
    const int N = 1600000000;

    char *data = reinterpret_cast<char *>(((reinterpret_cast<unsigned long long>(new char[N+32])+15)&~15));
    for (int i=0; i<N+16; i++) data[i] = 0;

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop32<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop32<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop32<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop32<N>(data+1);
        }
        long long int t4 = t();

        printf(" 32-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf(" 32-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop64<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop64<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop64<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop64<N>(data+1);
        }
        long long int t4 = t();

        printf(" 64-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf(" 64-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop128a<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop128u<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop128a<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop128u<N>(data+1);
        }
        long long int t4 = t();

        printf("128-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf("128-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }
}

Holmium answered 16/7, 2017 at 12:44 Comment(13)

This question is probably better asked at SE Code Review. – Chong 16/7, 2017 at 12:45

What happened when you ran it? – Vega 16/7, 2017 at 12:49

@user0042: maybe, I don't find any description which question should go to Code Review. But this is not a typical code review question, I think. – Holmium 16/7, 2017 at 12:53

@JohnZwinck: what do you mean by that? It printed benchmark results for my PC. – Holmium 16/7, 2017 at 12:54

@Holmium If you have working code, it's good for SE Code Review. – Chong 16/7, 2017 at 12:54

@user0042: yeah, but this time the question is not code quality, or algorithms, or things like that. It is very low level stuff. But thanks, if it gets closed, I'll move it to there. – Holmium 16/7, 2017 at 13:3

@old_timer: I'm interested in unaligned access on average mostly. That's what is this benchmark do (in theory...). But, if you say, in certain circumstances unaligned access is much-much slower, I'm interested in that too. What occasion does it happen? I'm interested in a "little slower" part too. I'd kindly ask, can you give some more information about these cases? – Holmium 16/7, 2017 at 13:41

@old_timer: thanks for this! I'll share my benchmark results soon, I'm on phone now. – Holmium 16/7, 2017 at 14:25

@harold Buncha people thought it was off-topic when it was first posted, and the downvotes piled on. Took a while for the optimization experts to wake up on a lazy Sunday morning and see it, I guess. :-) – Vivacious 16/7, 2017 at 15:36

@harold: As soon as I posted a useful answer, people realized it wasn't that bad a question. It kinda looks like it belongs on codereview, but it's really asking about what did I miss in testing memory performance. It went from -3 to 0 in a couple minutes :P – Swatter 16/7, 2017 at 16:11

@old_timer: (certainly on x86 with an operating system there is enough overhead to skew the results). This is very rarely a problem these days. CPUs are so fast now that a timer interrupt only happens once in 4 million cycles. With multiple cores, any crap your desktop is running in the background can usually use another core, if you did a decent job of making your desktop as idle as possible. Also, Linux virtualizes performance counters so you can use perf to just get counts for user-space cycles, with very good precision. Definitely enough to detect any unaligned penalty. – Swatter 16/7, 2017 at 17:9

For what it's worth, uarch-bench has a test specifically testing the throughput of loads and stores in L1D for all alignments within a 64-byte lines. It only runs on Linux currently (but a Windows port should be easy) and it generally gets results accurate to 1% of better. There is definitely still a penalty for some misaligned loads on every measured architecture, although for recent Intel it is only loads that cross a 64-byte boundary. Some more results and discussion here. – Rufus 16/7, 2017 at 17:38

@Holmium - FWIW for a question like this, I wouldn't tag it C or C++. Sure you are using C++ as a wrapper for your inline assembly, but it's not really releveant: it's mostly an x86 and performance question. I find that anything which can't be answered by parsing the standard and which has C or C++ tag often tends to get downvoted to oblivion. So if you are asking about performance, something non-standard, something which isn't recommended, I often try to find more specific tags and avoid those two. – Rufus 16/7, 2017 at 18:56

Timing method. I probably would have set it up so the test was selected by a command-line argument, so I could time it with perf stat ./unaligned-test, and get perf counter results instead of just wall-clock times for each test. That way, I wouldn't have to care about turbo / power-saving, since I could measure in core clock cycles. (Not the same thing as gettimeofday / rdtsc reference cycles unless you disable turbo and other frequency-variation. Even then, the CPU frequency doesn't always match the TSC, but at fixed CPU frequency, wall-clock-equivalent timers such as rdtsc are usable.)

You're only testing throughput, not latency, because none of the loads are dependent.

Your cache numbers will be worse than your memory numbers, but you maybe won't realize that it's because your cache numbers may be due to bottlenecking on the number of split-load registers that handle loads/stores that cross a cache-line boundary. For sequential read, the outer levels of cache are still always just going to see a sequence of requests for whole cache lines. It's only the execution units getting data from L1D that have to care about alignment. To test misalignment for the non-cached case, you could do scattered loads, so cache-line splits would need to bring two cache lines into L1.

Cache lines are 64 bytes wide¹, so you're always testing a mix of cache-line splits and within-a-cache-line accesses. Testing always-split loads would bottleneck harder on the split-load microarchitectural resources. (Actually, depending on your CPU, the cache-fetch width might be narrower than the line size. Recent Intel CPUs can fetch any unaligned chunk from inside a cache line, but that's because they have special hardware to make that fast. Other CPUs may only be at their fastest when fetching within a naturally-aligned 16 byte chunk or something. @BeeOnRope says that AMD CPUs may care about 16 byte and 32 byte boundaries. See also https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#memory-related-limits)

You're not testing store → load forwarding at all. For existing tests, and a nice way to visualize results for different alignments, see this stuffedcow.net blog post: Store-to-Load Forwarding and Memory Disambiguation in x86 Processors.

Passing data through memory is an important use case, and misalignment + cache-line splits can interfere with store-forwarding on some CPUs. To properly test this, make sure you test different misalignments, not just 1:15 (vector) or 1:3 (integer). (You currently only test a +1 offset relative to 16B-alignment).

I forget if it's just for store-forwarding, or for regular loads, but there may be less penalty when a load is split evenly across a cache-line boundary (an 8:8 vector, and maybe also 4:4 or 2:2 integer splits). You should test this. (I might be thinking of P4 lddqu or Core 2 movqdu)

Intel's optimization manual has big tables of misalignment vs. store-forwarding from a wide store to narrow reloads that are fully contained in it. On some CPUs, this works in more cases when the wide store was naturally-aligned, even if it doesn't cross any cache-line boundaries. (Maybe on SnB/IvB, since they use a banked L1 cache with 16B banks, and splits across those can affect store forwarding.

I didn't re-check the manual, but if you really want to test this experimentally, that's something you should be looking for.)

Which reminds me, misaligned loads are more likely to provoke cache-bank conflicts on SnB/IvB (because one load can touch two banks). But you won't see this loading from a single stream, because accessing the same bank in the same line twice in one cycle is fine. It's only accessing the same bank in different lines that can't happen in the same cycle. (e.g., when two memory accesses are a multiple of 128 bytes apart.)

You don't make any attempt to test 4k page-splits. They are slower than regular cache-line splits, because they also need two TLB checks. (Skylake improved them from a ~100 cycles penalty to a ~5 cycles penalty beyond the normal load-use latency, though)

You fail to test movups on aligned addresses, so you wouldn't detect that movups is slower than movaps on Core 2 and earlier even when the memory is aligned at runtime. (I think unaligned mov loads up to 8 bytes were fine even in Core 2, as long as they didn't cross a cache-line boundary. IDK how old a CPU you'd have to look at to find a problem with non-vector loads within a cache line. It would be a 32-bit only CPU, but you could still test 8 byte loads with MMX or SSE, or even x87. P5 Pentium and later guarantee that aligned 8 byte loads/stores are atomic, but P6 and newer guarantee that cached 8 byte loads/stores are atomic as long as no cache-line boundary is crossed. Unlike AMD, where 8 byte boundaries matter for atomicity guarantees even in cacheable memory. Why is integer assignment on a naturally aligned variable atomic on x86?)

Go look at Agner Fog's stuff to learn more about how unaligned loads can be slower, and cook up tests to exercise those cases. Actually, Agner may not be the best resource for that, since his microarchitecture guide mostly focuses on getting uops through the pipeline. Just a brief mention of the cost of cache-line splits, nothing in-depth about throughput vs. latency.

See also: Cacheline splits, take two, from Dark Shikari's blog (x264 lead developer), talking about unaligned load strategies on Core2: it was worth it to check for alignment and use a different strategy for the block.

Footnote 1 64B cache lines is a safe assumption these days. Pentium 3 and earlier had 32B lines. P4 had 64B lines but they were often transferred in 128B-aligned pairs. I thought I remembered reading that P4 actually had 128B lines in L2 or L3, but maybe that was just a distortion of 64B lines transferred in pairs. 7-CPU definitely says 64B lines in both levels of cache for a P4 130nm.

Modern Intel CPUs have adjacent-line L2 "spatial" prefetch similarly tends to pull in the other half of a 128-byte aligned pair, which can increase false sharing in some cases. Should the cache padding size of x86-64 be 128 bytes? shows an experiment that demonstrates this.

See also uarch-bench results for Skylake. Apparently someone has already written a tester that checks every possible misalignment relative to a cache-line boundary.

My testing on Skylake desktop (i7-6700k)

Addressing mode affects load-use latency, exactly as Intel documents in their optimization manual. I tested with integer mov rax, [rax+...], and with movzx/sx (in that case using the loaded value as an index, since it's too narrow to be a pointer).

;;;  Linux x86-64 NASM/YASM source.  Assemble into a static binary
;; public domain, originally written by [email protected].
;; Share and enjoy.  If it breaks, you get to keep both pieces.

;;; This kind of grew while I was testing and thinking of things to test
;;; I left in some of the comments, but took out most of them and summarized the results outside this code block
;;; When I thought of something new to test, I'd edit, save, and up-arrow my assemble-and-run shell command
;;; Then edit the result into a comment in the source.

section .bss

ALIGN   2 * 1<<20   ; 2MB = 4096*512.  Uses hugepages in .bss but not in .data.  I checked in /proc/<pid>/smaps
buf:    resb 16 * 1<<20

section .text
global _start
_start:
    mov     esi, 128

;   mov             edx, 64*123 + 8
;   mov             edx, 64*123 + 0
;   mov             edx, 64*64 + 0
    xor             edx,edx
   ;; RAX points into buf, 16B into the last 4k page of a 2M hugepage

    mov             eax, buf + (2<<20)*0 + 4096*511 + 64*0 + 16
    mov             ecx, 25000000

%define ADDR(x)  x                     ; SKL: 4c
;%define ADDR(x)  x + rdx              ; SKL: 5c
;%define ADDR(x)  128+60 + x + rdx*2   ; SKL: 11c cache-line split
;%define ADDR(x)  x-8                 ; SKL: 5c
;%define ADDR(x)  x-7                 ; SKL: 12c for 4k-split (even if it's in the middle of a hugepage)
; ... many more things and a block of other result-recording comments taken out

%define dst rax



        mov             [ADDR(rax)], dst
align 32
.loop:
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
    dec         ecx
    jnz .loop

        xor edi,edi
        mov eax,231
    syscall

Then run with

asm-link load-use-latency.asm && disas load-use-latency && 
    perf stat -etask-clock,cycles,L1-dcache-loads,instructions,branches -r4 ./load-use-latency

+ yasm -felf64 -Worphan-labels -gdwarf2 load-use-latency.asm
+ ld -o load-use-latency load-use-latency.o
 (disassembly output so my terminal history has the asm with the perf results)

 Performance counter stats for './load-use-latency' (4 runs):

     91.422838      task-clock:u (msec)       #    0.990 CPUs utilized            ( +-  0.09% )
   400,105,802      cycles:u                  #    4.376 GHz                      ( +-  0.00% )
   100,000,013      L1-dcache-loads:u         # 1093.819 M/sec                    ( +-  0.00% )
   150,000,039      instructions:u            #    0.37  insn per cycle           ( +-  0.00% )
    25,000,031      branches:u                #  273.455 M/sec                    ( +-  0.00% )

   0.092365514 seconds time elapsed                                          ( +-  0.52% )

In this case, I was testing mov rax, [rax], naturally-aligned, so cycles = 4*L1-dcache-loads. 4c latency. I didn't disable turbo or anything like that. Since nothing is going off the core, core clock cycles is the best way to measure.

[base + 0..2047]: 4c load-use latency, 11c cache-line split, 11c 4k-page split (even when inside the same hugepage). See Is there a penalty when base+offset is in a different page than the base? for more details: if base+disp turns out to be in a different page than base, the load uop has to be replayed.
any other addressing mode: 5c latency, 11c cache-line split, 12c 4k-split (even inside a hugepage). This includes [rax - 16]. It's not disp8 vs. disp32 that makes the difference.

So: hugepages don't help avoid page-split penalties (at least not when both pages are hot in the TLB). A cache-line split makes addressing mode irrelevant, but "fast" addressing modes have 1c lower latency for normal and page-split loads.

4k-split handling is fantastically better than before, see @harold's numbers where Haswell has ~32c latency for a 4k-split. (And older CPUs may be even worse than that. I thought pre-SKL it was supposed to be ~100 cycle penalty.)

Throughput (regardless of addressing mode), measured by using a destination other than rax so the loads are independent:

no split: 0.5c.
CL-split: 1c.
4k-split: ~3.8 to 3.9c (much better than pre-Skylake CPUs)

Same throughput/latency for movzx/movsx (including WORD splits), as expected because they're handled in the load port (unlike some AMD CPUs, where there's also an ALU uop).

Uops dependent on cache-line split loads get replayed from the RS (Reservation Station). Counters for uops_dispatched_port.port_2 + port_3 = 2x number of mov rdi, [rdi], in another test using basically the same loop. (This was a dependent-load case, not throughput limited.) The CPU can't detect a split load until after AGU produces a linear address.

I previously thought split loads themselves got replayed, but that was based on this pointer-chasing test where every load is dependent on a previous load. If we put an imul rdi, rdi, 1 in the loop, we'd get extra port 1 ALU counts for it getting replayed, not the loads.

A split load only has to dispatch once, but I'm not sure if it later borrows a cycle in the same load port to access the other cache line (and combine it with the first part saved in a split register inside that load port.) Or to initiate a demand-load for the other line if it's not present in L1d.

Whatever the details, throughput of cache-line-split loads is lower than non-splits even if you avoid replays of loads. (We didn't test pointer chasing with that anyway.)

See also Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up? for more about uop replays. (But note that's for uops dependent on a load, not the load uop itself. In that Q&A, the dependent uops are also mostly loads.)

A cache-miss load doesn't itself need to be replayed to "accept" the incoming data when it's ready, only dependent uops. See chat discussion on Are load ops deallocated from the RS when they dispatch, complete or some other time?. This https://godbolt.org/z/HJF3BN NASM test case on i7-6700k shows the same number of load uops dispatched regardless of L1d hits or L3 hits. But the number of ALU uops dispatched (not counting loop overhead) goes from 1 per load to ~8.75 per load. The scheduler aggressively schedules uops consuming the data to dispatch in the cycle when load data might arrive from L2 cache (and then very aggressively after that, it seems), instead of waiting one extra cycle to see if it did or not.

We haven't tested how aggressive replay is when there's other independent but younger work that could be done on the same port whose inputs are definitely ready.

SKL has two hardware page-walk units, which is probably related to the massive improvement in 4k-split performance. Even when there are no TLB misses, presumably older CPUs had to account for the fact that there might be.

It's interesting that the 4k-split throughput is non-integer. I think my measurements had enough precision and repeatability to say this. Remember this is with every load being a 4k-split, and no other work going on (except for being inside a small dec/jnz loop). If you ever have this in real code, you're doing something really wrong.

I don't have any solid guesses at why it might be non-integer, but clearly there's a lot that has to happen microarchitecturally for a 4k-split. It's still a cache-line split, and it has to check the TLB twice.

Swatter answered 16/7, 2017 at 14:39 Comment(15)

The comments that brought this on prompted to to look at Agner Fog's manuals briefly, and I didn't see much of anything about aligned vs. unaligned loads except in the context of SIMD. Although geza does include some SIMD code in the test case here, I think he's mainly thinking of memory-to-integer-register loads (or at least, that was the original motivation). Anyway, on an unrelated note, you say here that cache lines are 64 bytes wide. Is that always true for x86? I was wondering about that when doing the P4 tests the other day. – Vivacious 16/7, 2017 at 15:5

Agner Fog does at least define throughput and latency well in the instruction tables introduction. That's one of the clearer definitions I've seen. Although if I remember correctly, someone asked a question about the difference one time, and you posted a pretty good answer there, too. – Vivacious 16/7, 2017 at 15:7

@CodyGray: Oh yes, my bad. IIRC, P3 has 32B lines. P4 has 64B lines, but maybe L2 cache is 128B, I forget. Everything since then has 64B lines all the way down. – Swatter 16/7, 2017 at 15:7

@CodyGray: Are you thinking of Intel Intrinsics guide - Latency and Throughput? It's convenient that my last name is uncommon, so searching on it with some computery keyword often brings up what I was looking for when I remember having written some kind of answer about a subject, and want to link it. :) – Swatter 16/7, 2017 at 15:18

You can search your answers from your SO profile page, too. When you navigate there, it automatically fills it in with the user ID. Works for any user. I use that pretty often. Of course, the SO search still has nothing on Google, so site:stackoverflow.com plus a user name plus a keyword is still often my go-to tool. "latency" and "throughput" are hard keywords to search for, though. They crop up too often! – Vivacious 16/7, 2017 at 15:25

@CodyGray: Good point about focusing on integer loads. I thought I had been pretty general about stuff that applied to both, but I went back and looked at the even-split part and I was only thinking vectors there. I ended up adding a bunch of new stuff. :P – Swatter 16/7, 2017 at 15:39

@CodyGray Keep in mind that the cache line size isn't necessarily the only interesting boundary: for loads and stores you often have other smaller "cache access size" boundaries (although on recent Intel that also seems to be 64-bytes). On AMD, for example, the 16B and 32B boundaries matter. You can find a brief discussion here. – Rufus 16/7, 2017 at 17:53

@PeterCordes - here are the Ryzen results which show the dependence on 16B and 32B boundaries (reported here). A summary of the behavior starts at "What I see for Ryzen:" in this post. – Rufus 16/7, 2017 at 18:14

BTW, outside of not being updated for newer CPUs I think this blog post still has the best way to visualize latencies for loads and stores. Strictly speaking it is trying to investigate store-to-load latency, but the entries off the main diagonal don't overlap so there it becomes a throughput test of stores and loads (and you can clearly see that even going back many generations Intel only really suffers at the edge of a 64B boundary). It shows clearly how AMD has various interesting effects around 16B boundaries also. – Rufus 16/7, 2017 at 19:14

@BeeOnRope: I added a public domain notice for the code, in case the usual SO CC-by-SA is a problem for anyone. It's too trival to bother GPLing or anything. – Swatter 16/7, 2017 at 20:13

Thanks @Peter, I may use bits of it in uarch-bench (I'm the author). – Rufus 16/7, 2017 at 20:14

@BeeOnRope: There was another set of Ryzen results in that thread where dependent add was showing 0.56 cycles or something. I assume that was a measurement / turbo ramp-up artifact, right? And cheers, glad my code was useful. – Swatter 16/7, 2017 at 20:19

Yes, it was run on earlier version that didn't have ramp-up code at the start, and so the CPU was running at just above half speed during the calibration loop, but the results are still useful (just mentally map 0.56 to 1 cycle) as explained in this post. Later on that user re-ran the test with warmup and got the expected results. @PeterCordes – Rufus 16/7, 2017 at 20:50

@PeterCordes: I'd expect that for "4K split" (where CPU has to be able to tolerate different caching - e.g. half on "write-back" page and half on "uncached" page); CPU behaves as if its 2 completely separate writes (with double the costs). Also (due to high level paging structure caches) the pathological case would be "512 G split"; possibly with wrapping (e.g. writing 4 bytes such that 2 bytes go to virtual address 0xFFFFFFFFFFFFFFFE and the other 2 bytes go to 0x0000000000000000), which is so deviously nasty that I wouldn't exclude the possibility of hitting CPU errata on some CPU/s. – Wooley 20/2, 2019 at 23:40

Another factor for SnB at the very least is if there are any stores inflight any unaligned load that crosses a 16byte boundary or unaligned 32byte load will end up serializing with the store irrelivant of conflict (there will be no memory-disambiguation prediction). If its related to the 16byte banks then probably isnt an issue as of haswell. I've tested that it isnt an issue on Tigerlake. – Selfseeking 21/6, 2021 at 17:56

Testing 64-bit loads for various offsets (code below), my raw results on Haswell are:

aligned L: 4.01115 T: 0.500003
ofs1 L: 4.00919 T: 0.500003
ofs2 L: 4.01494 T: 0.500003
ofs3 L: 4.01403 T: 0.500003
ofs7 L: 4.01073 T: 0.500003
ofs15 L: 4.01937 T: 0.500003
ofs31 L: 4.02107 T: 0.500002
ofs60 L: 9.01482 T: 1
ofs62 L: 9.03644 T: 1
ofs4092 L: 32.3014 T: 31.1967

Apply rounding as you see fit. Most of them should obviously be rounded down, but .3 and .2 (from the page boundary crossing) are perhaps too significant to be noise. This only tested loads with simple addresses, and only "pure loads", no forwarding.

I conclude that alignment within a cache line is not relevant for scalar loads, only crossing cache line boundaries and (especially, and for obvious reasons) crossing page boundaries matters. There seems to be no difference between crossing a cache line boundary exactly in the middle or somewhere else in this case.

AMD occasionally has some funny effects with 16-byte boundaries, but I cannot test that.

And here are raw(!) xmm vector results which include the effects of pextrq, so subtract two cycles of latency:

aligned L: 8.05247 T: 0.500003
ofs1 L: 8.03223 T: 0.500003
ofs2 L: 8.02899 T: 0.500003
ofs3 L: 8.05598 T: 0.500003
ofs7 L: 8.03579 T: 0.500002
ofs15 L: 8.02787 T: 0.500003
ofs31 L: 8.05002 T: 0.500003
ofs58 L: 13.0404 T: 1
ofs60 L: 13.0825 T: 1
ofs62 L: 13.0935 T: 1
ofs4092 L: 36.345 T: 31.2357

The testing code was

global test_unaligned_l
proc_frame test_unaligned_l
    alloc_stack 8
[endprolog]
    mov r9, rcx
    rdtscp
    mov r8d, eax

    mov ecx, -10000000
    mov rdx, r9
.loop:
    mov rdx, [rdx]
    mov rdx, [rdx]
    add ecx, 1
    jnc .loop

    rdtscp
    sub eax, r8d

    add rsp, 8
    ret
endproc_frame

global test_unaligned_tp
proc_frame test_unaligned_tp
    alloc_stack 8
[endprolog]
    mov r9, rcx
    rdtscp
    mov r8d, eax

    mov ecx, -10000000
    mov rdx, r9
.loop:
    mov rax, [rdx]
    mov rax, [rdx]
    add ecx, 1
    jnc .loop

    rdtscp
    sub eax, r8d

    add rsp, 8
    ret
endproc_frame

For vectors largely similar but with pextrq in the latency test.

With some data prepared at various offsets, for example:

align 64
%rep 31
db 0
%endrep
unaligned31: dq unaligned31
align 4096
%rep 60
db 0
%endrep
unaligned60: dq unaligned60
align 4096
%rep 4092
db 0
%endrep
unaligned4092: dq unaligned4092

To focus a bit more on the new title, I'll describe what this is trying to do and why.

First off, there is a latency test. Loading a million things into eax from some pointer that isn't in eax (as the code in the question does) tests throughput, which is only half of the picture. For scalar loads that is trivial, for vector loads I used pairs of:

movdqu xmm0, [rdx]
pextrq rdx, xmm0, 0

The latency of pextrq is 2, that's why the latency figures for vector loads are all 2 too high as noted.

In order to make it easy to do this latency test, the data is a self-referential pointer. That's a fairly atypical scenario, but it shouldn't affect the timing characteristics of the loads.

The throughput test has two loads per loop instead of one to avoid being bottlenecked by the loop overhead. More loads could be used, but that isn't necessary on Haswell (or anything I can think of, but in theory a microarchitecture with a lower branch throughput or a higher load throughput could exist).

I'm not super careful about fencing in the TSC read or compensating for its overhead (or other overhead). I also didn't disable Turbo, I just let it run at turbo frequency and divided by the ratio between the TSC rate and turbo freq, which could affects timings a bit. All of these effects are all tiny compared to a benchmark on the order of 1E7, and the results can be rounded anyway.

All times were best-of-30, things such as average and variance are pointless on these micro benchmarks since the ground truth is not a random process with parameters that we want to estimate but some fixed integer¹ (or integer multiple of a fraction, for throughput). Almost all noise is positive, except the (relatively theoretical) case of instructions from the benchmark "leaking" in front of the first TSC read (this could even be avoided if necessary), so taking the minimum is appropriate.

Note 1: except crossing a 4k boundary apparently, something strange is happening there.

Marchak answered 16/7, 2017 at 15:37 Comment(13)

The even-split thing might just be for store-forwarding, not for loads. Or for loads, maybe it was more efficient on Core2 or something, but not Haswell. – Swatter 16/7, 2017 at 15:47

re: asm style. align directives work in the BSS, so you could have used resb. Or you could have used times 4092 db 0 instead of %rep. – Swatter 16/7, 2017 at 15:48

@PeterCordes this is not in the BSS though, but yes times would do – Marchak 16/7, 2017 at 15:51

I meant you could have used the BSS, even though you want control over alignment :P Oh, I just noticed you are putting self-referring pointers in your data. NVM then. – Swatter 16/7, 2017 at 15:52

@PeterCordes it seemed useful for the latency test, I could also add a zero from BSS to the pointer I guess.. – Marchak 16/7, 2017 at 15:55

Oh right. When I tested load-use latency, I used a mov r32, imm32 to put an address into a register, then stored it to itself. (Or I could have used a stack address. I think I was worried about the stack being somehow different.) Anyway, either way makes sense. – Swatter 16/7, 2017 at 16:0

BTW, if you test different addressing modes, you should find that [base + 0..2047] has 4c load-use latency, and anything else has 5c latency. Intel's manual says that, and I've confirmed it on SKL-S. CL-split don't care about addressing mode: 11c load-use latency for cache-line splits. But 4k splits are also 11c total load-use latency for fast addressing modes, and 12c for other addressing modes. (These are all for 64-bit integer mov loads, and for movzx/sx word -> dword or qword loads). – Swatter 16/7, 2017 at 16:5

SKL throughput for 64-bit mov rax, [rax+...]: no split: 0.5c. CL-split: 1c. 4k-split: ~3.8 to 3.9c. Regardless of addressing mode. – Swatter 16/7, 2017 at 16:7

@PeterCordes well put it in your answer, otherwise no one will see it, that would be a waste. Do you have any theory about where the non-integral timings for a page split come from? – Marchak 16/7, 2017 at 16:10

right, good point. I was thinking it belonged in/with your answer since yours is the one that already has experimental results. But I just added a section at the end of mine. – Swatter 16/7, 2017 at 17:2

BTW whoever is doing it, stop serially upvoting my old answers. It's just going to get undone anyway when the system catches on. – Marchak 16/7, 2017 at 21:5

Why pextrq rdx,xmm0,0 instead of movq rdx,xmm0? That would be 1 less uop, and 1c better latency. (And also only requiring SSE2). – Swatter 16/7, 2017 at 21:11

Let us continue this discussion in chat. – Marchak 16/7, 2017 at 21:16

I'm putting my little bit improved benchmark here. Still measures throughput only (and only unaligned offset 1). Based on the other answers, I've added measuring 64- and 4096-byte splits.

For 4k splits, there's a huge difference! But if the data doesn't cross the 64 byte boundary, there's no speed loss at all (at least for these 2 processors I've tested).

Looking at these numbers (and numbers at other answers), my conclusion is that unaligned access is fast on average (both throughput and latency), but there are cases when it can be much slower. But this doesn't mean that their usage is discouraged.

Raw numbers produced by my benchmark should be taken with a grain of salt (it is highly likely that a properly written asm code outperforms it), but these results mostly agree with harold's answer for Haswell (difference column).

Haswell:

Full:
 32-bit, cache: aligned:  33.2901 GB/sec unaligned:  29.5063 GB/sec, difference: 1.128x
 32-bit,   mem: aligned:  12.1597 GB/sec unaligned:  12.0659 GB/sec, difference: 1.008x
 64-bit, cache: aligned:  66.0368 GB/sec unaligned:  52.8914 GB/sec, difference: 1.249x
 64-bit,   mem: aligned:  16.1317 GB/sec unaligned:  16.0568 GB/sec, difference: 1.005x
128-bit, cache: aligned: 129.8730 GB/sec unaligned:  87.9791 GB/sec, difference: 1.476x
128-bit,   mem: aligned:  16.8150 GB/sec unaligned:  16.8151 GB/sec, difference: 1.000x

JustBoundary64:
 32-bit, cache: aligned:  32.5555 GB/sec unaligned:  16.0175 GB/sec, difference: 2.032x
 32-bit,   mem: aligned:   1.0044 GB/sec unaligned:   1.0001 GB/sec, difference: 1.004x
 64-bit, cache: aligned:  65.2707 GB/sec unaligned:  32.0431 GB/sec, difference: 2.037x
 64-bit,   mem: aligned:   2.0093 GB/sec unaligned:   2.0007 GB/sec, difference: 1.004x
128-bit, cache: aligned: 130.6789 GB/sec unaligned:  64.0851 GB/sec, difference: 2.039x
128-bit,   mem: aligned:   4.0180 GB/sec unaligned:   3.9994 GB/sec, difference: 1.005x

WithoutBoundary64:
 32-bit, cache: aligned:  33.2911 GB/sec unaligned:  33.2916 GB/sec, difference: 1.000x
 32-bit,   mem: aligned:  11.6156 GB/sec unaligned:  11.6223 GB/sec, difference: 0.999x
 64-bit, cache: aligned:  65.9117 GB/sec unaligned:  65.9548 GB/sec, difference: 0.999x
 64-bit,   mem: aligned:  14.3200 GB/sec unaligned:  14.3027 GB/sec, difference: 1.001x
128-bit, cache: aligned: 128.2605 GB/sec unaligned: 128.3342 GB/sec, difference: 0.999x
128-bit,   mem: aligned:  12.6352 GB/sec unaligned:  12.6218 GB/sec, difference: 1.001x

JustBoundary4096:
 32-bit, cache: aligned:  33.5500 GB/sec unaligned:   0.5415 GB/sec, difference: 61.953x
 32-bit,   mem: aligned:   0.4527 GB/sec unaligned:   0.0431 GB/sec, difference: 10.515x
 64-bit, cache: aligned:  67.1141 GB/sec unaligned:   1.0836 GB/sec, difference: 61.937x
 64-bit,   mem: aligned:   0.9112 GB/sec unaligned:   0.0861 GB/sec, difference: 10.582x
128-bit, cache: aligned: 134.2000 GB/sec unaligned:   2.1668 GB/sec, difference: 61.936x
128-bit,   mem: aligned:   1.8165 GB/sec unaligned:   0.1700 GB/sec, difference: 10.687x

Sandy Bridge (processor from 2011)

Full:
 32-bit, cache: aligned:  30.0302 GB/sec unaligned:  26.2587 GB/sec, difference: 1.144x
 32-bit,   mem: aligned:  11.0317 GB/sec unaligned:  10.9358 GB/sec, difference: 1.009x
 64-bit, cache: aligned:  59.2220 GB/sec unaligned:  41.5515 GB/sec, difference: 1.425x
 64-bit,   mem: aligned:  14.5985 GB/sec unaligned:  14.3760 GB/sec, difference: 1.015x
128-bit, cache: aligned: 115.7643 GB/sec unaligned:  45.0905 GB/sec, difference: 2.567x
128-bit,   mem: aligned:  14.8561 GB/sec unaligned:  14.8220 GB/sec, difference: 1.002x

JustBoundary64:
 32-bit, cache: aligned:  15.2127 GB/sec unaligned:   3.1037 GB/sec, difference: 4.902x
 32-bit,   mem: aligned:   0.9870 GB/sec unaligned:   0.6110 GB/sec, difference: 1.615x
 64-bit, cache: aligned:  30.2074 GB/sec unaligned:   6.2258 GB/sec, difference: 4.852x
 64-bit,   mem: aligned:   1.9739 GB/sec unaligned:   1.2194 GB/sec, difference: 1.619x
128-bit, cache: aligned:  60.7265 GB/sec unaligned:  12.4007 GB/sec, difference: 4.897x
128-bit,   mem: aligned:   3.9443 GB/sec unaligned:   2.4460 GB/sec, difference: 1.613x

WithoutBoundary64:
 32-bit, cache: aligned:  30.0348 GB/sec unaligned:  29.9801 GB/sec, difference: 1.002x
 32-bit,   mem: aligned:  10.7067 GB/sec unaligned:  10.6755 GB/sec, difference: 1.003x
 64-bit, cache: aligned:  59.1895 GB/sec unaligned:  59.1925 GB/sec, difference: 1.000x
 64-bit,   mem: aligned:  12.9404 GB/sec unaligned:  12.9307 GB/sec, difference: 1.001x
128-bit, cache: aligned: 116.4629 GB/sec unaligned: 116.0778 GB/sec, difference: 1.003x
128-bit,   mem: aligned:  11.2963 GB/sec unaligned:  11.3533 GB/sec, difference: 0.995x

JustBoundary4096:
 32-bit, cache: aligned:  30.2457 GB/sec unaligned:   0.5626 GB/sec, difference: 53.760x
 32-bit,   mem: aligned:   0.4055 GB/sec unaligned:   0.0275 GB/sec, difference: 14.726x
 64-bit, cache: aligned:  60.6175 GB/sec unaligned:   1.1257 GB/sec, difference: 53.851x
 64-bit,   mem: aligned:   0.8150 GB/sec unaligned:   0.0551 GB/sec, difference: 14.798x
128-bit, cache: aligned: 121.2121 GB/sec unaligned:   2.2455 GB/sec, difference: 53.979x
128-bit,   mem: aligned:   1.6255 GB/sec unaligned:   0.1103 GB/sec, difference: 14.744x

Here's the code:

#include <sys/time.h>
#include <stdio.h>

__attribute__((always_inline))
void load32(const char *v) {
    __asm__ ("mov     %0, %%eax" : : "m"(*v) :"eax");
}

__attribute__((always_inline))
void load64(const char *v) {
    __asm__ ("mov     %0, %%rax" : : "m"(*v) :"rax");
}

__attribute__((always_inline))
void load128a(const char *v) {
    __asm__ ("movaps     %0, %%xmm0" : : "m"(*v) :"xmm0");
}

__attribute__((always_inline))
void load128u(const char *v) {
    __asm__ ("movups     %0, %%xmm0" : : "m"(*v) :"xmm0");
}

struct Full {
    template <int S>
    static float factor() {
        return 1.0f;
    }
    template <void (*LOAD)(const char *), int S, int N>
    static void loop(const char *v) {
        for (int i=0; i<N; i+=S*16) {
            LOAD(v+S* 0);
            LOAD(v+S* 1);
            LOAD(v+S* 2);
            LOAD(v+S* 3);
            LOAD(v+S* 4);
            LOAD(v+S* 5);
            LOAD(v+S* 6);
            LOAD(v+S* 7);
            LOAD(v+S* 8);
            LOAD(v+S* 9);
            LOAD(v+S*10);
            LOAD(v+S*11);
            LOAD(v+S*12);
            LOAD(v+S*13);
            LOAD(v+S*14);
            LOAD(v+S*15);
            v += S*16;
        }
    }
};

struct JustBoundary64 {
    template <int S>
    static float factor() {
        return S/64.0f;
    }
    template <void (*LOAD)(const char *), int S, int N>
    static void loop(const char *v) {
        static_assert(N%(64*16)==0);
        for (int i=0; i<N; i+=64*16) {
            LOAD(v+64* 1-S);
            LOAD(v+64* 2-S);
            LOAD(v+64* 3-S);
            LOAD(v+64* 4-S);
            LOAD(v+64* 5-S);
            LOAD(v+64* 6-S);
            LOAD(v+64* 7-S);
            LOAD(v+64* 8-S);
            LOAD(v+64* 9-S);
            LOAD(v+64*10-S);
            LOAD(v+64*11-S);
            LOAD(v+64*12-S);
            LOAD(v+64*13-S);
            LOAD(v+64*14-S);
            LOAD(v+64*15-S);
            LOAD(v+64*16-S);
            v += 64*16;
        }
    }
};

struct WithoutBoundary64 {
    template <int S>
    static float factor() {
        return (64-S)/64.0f;
    }
    template <void (*LOAD)(const char *), int S, int N>
    static void loop(const char *v) {
        for (int i=0; i<N; i+=S*16) {
            if ((S* 1)&0x3f) LOAD(v+S* 0);
            if ((S* 2)&0x3f) LOAD(v+S* 1);
            if ((S* 3)&0x3f) LOAD(v+S* 2);
            if ((S* 4)&0x3f) LOAD(v+S* 3);
            if ((S* 5)&0x3f) LOAD(v+S* 4);
            if ((S* 6)&0x3f) LOAD(v+S* 5);
            if ((S* 7)&0x3f) LOAD(v+S* 6);
            if ((S* 8)&0x3f) LOAD(v+S* 7);
            if ((S* 9)&0x3f) LOAD(v+S* 8);
            if ((S*10)&0x3f) LOAD(v+S* 9);
            if ((S*11)&0x3f) LOAD(v+S*10);
            if ((S*12)&0x3f) LOAD(v+S*11);
            if ((S*13)&0x3f) LOAD(v+S*12);
            if ((S*14)&0x3f) LOAD(v+S*13);
            if ((S*15)&0x3f) LOAD(v+S*14);
            if ((S*16)&0x3f) LOAD(v+S*15);
            v += S*16;
        }
    }
};

struct JustBoundary4096 {
    template <int S>
    static float factor() {
        return S/4096.0f;
    }
    template <void (*LOAD)(const char *), int S, int N>
    static void loop(const char *v) {
        static_assert(N%(4096*4)==0);
        for (int i=0; i<N; i+=4096*4) {
            LOAD(v+4096*1-S);
            LOAD(v+4096*2-S);
            LOAD(v+4096*3-S);
            LOAD(v+4096*4-S);
            v += 4096*4;
        }
    }
};


long long int t() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (long long int)tv.tv_sec*1000000 + tv.tv_usec;
}

template <typename TYPE, void (*LOADa)(const char *), void (*LOADu)(const char *), int S, int N>
void bench(const char *data, int iter, const char *name) {
    long long int t0 = t();
    for (int i=0; i<iter*100000; i++) {
        TYPE::template loop<LOADa, S, N/100000>(data);
    }
    long long int t1 = t();
    for (int i=0; i<iter*100000; i++) {
        TYPE::template loop<LOADu, S, N/100000>(data+1);
    }
    long long int t2 = t();
    for (int i=0; i<iter; i++) {
        TYPE::template loop<LOADa, S, N>(data);
    }
    long long int t3 = t();
    for (int i=0; i<iter; i++) {
        TYPE::template loop<LOADu, S, N>(data+1);
    }
    long long int t4 = t();

    printf("%s-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3fx\n", name, (double)N*iter/(t1-t0)/1000*TYPE::template factor<S>(), (double)N*iter/(t2-t1)/1000*TYPE::template factor<S>(), (float)(t2-t1)/(t1-t0));
    printf("%s-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3fx\n", name, (double)N*iter/(t3-t2)/1000*TYPE::template factor<S>(), (double)N*iter/(t4-t3)/1000*TYPE::template factor<S>(), (float)(t4-t3)/(t3-t2));
}

int main() {
    const int ITER = 10;
    const int N = 1638400000;

    char *data = reinterpret_cast<char *>(((reinterpret_cast<unsigned long long>(new char[N+8192])+4095)&~4095));
    for (int i=0; i<N+8192; i++) data[i] = 0;

    printf("Full:\n");
    bench<Full, load32, load32, 4, N>(data, ITER, " 32");
    bench<Full, load64, load64, 8, N>(data, ITER, " 64");
    bench<Full, load128a, load128u, 16, N>(data, ITER, "128");

    printf("\nJustBoundary64:\n");
    bench<JustBoundary64, load32, load32, 4, N>(data, ITER, " 32");
    bench<JustBoundary64, load64, load64, 8, N>(data, ITER, " 64");
    bench<JustBoundary64, load128a, load128u, 16, N>(data, ITER, "128");

    printf("\nWithoutBoundary64:\n");
    bench<WithoutBoundary64, load32, load32, 4, N>(data, ITER, " 32");
    bench<WithoutBoundary64, load64, load64, 8, N>(data, ITER, " 64");
    bench<WithoutBoundary64, load128a, load128u, 16, N>(data, ITER, "128");

    printf("\nJustBoundary4096:\n");
    bench<JustBoundary4096, load32, load32, 4, N>(data, ITER*10, " 32");
    bench<JustBoundary4096, load64, load64, 8, N>(data, ITER*10, " 64");
    bench<JustBoundary4096, load128a, load128u, 16, N>(data, ITER*10, "128");
}

Holmium answered 17/7, 2017 at 22:47 Comment(5)

Printing the numbers in GB/s without also showing loads per cycle or per second is not that useful, especially for the integer loads. It just makes it harder to compare different sizes. It's well known that you will usually bottleneck on load-port uop throughput, not bandwidth per-se, when hitting in L1. – Swatter 20/7, 2017 at 2:57

You might need a longer warm-up period or something, because your "aligned" numbers are different in different tests. (This is why I like to measure core clock cycles with perf counters, not time or "reference cycles" (which is also just time)). – Swatter 20/7, 2017 at 2:59

@PeterCordes: yes, looking at the numbers, now I know what the bottleneck is here, too. :) I've tried a much longer test (run for 30 minutes), but the aligned numbers are still differ. Yes, perf counters a better method, but I don't know how to access them without external utility (maybe I'll look into this). I set cpu frequency to max with cpufreq-set, the numbers I get with gettimeofday is kinda OK for me (it has less than 1% variance) – Holmium 20/7, 2017 at 9:37

Yeah, perf stat is a lot easier than using a perf-counter library (which I've never bothered with either). That's why I suggested (in my answer) having each invocation of the program do one test, controlled by a command-line arg. So with a small near-constant startup overhead (especially for a static binary), you get easy perf counters. That's what I usually do for microbenchmarks in general, e.g. put a main(){ ... } inside an #ifdef in a .c or .cpp with a function I'm tuning. – Swatter 20/7, 2017 at 16:28

Keep in mind that memory related tests tend to show a ton more variation that CPU bound tests. It's pretty easy to get variation of 0.1% or 0.01% on a CPU bound test, even when measuring it from the outside with perf, once you turn off hyperthreading and turbo - but L3 and memory are a shared resource and I often see 10% variation or more. Even just having a browser open in the background may have a big impact. You may want to just run the test 100 times at which point the "typical" max values become obvious. Looking at the results graphically often makes the asymptote obvious too. – Rufus 20/7, 2017 at 19:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

My testing on Skylake desktop (i7-6700k)

Recommended topics

Hot tags