Performance of CPU-only code varies with executable file name (!!!)

Asked 24/12, 2023 at 19:24 Answered 25/12, 2023 at 14:33

Reproducing the issue

Run the following shell script on an Apple Silicon Mac (tried on a MacBook Air 2020 with M1 and a MacBook Pro 16" with M3 Max; both running macOS Sonoma, with Apple clang 15.0):

#!/bin/bash

rm -rf NTRU
git clone https://github.com/vector-polymul-ntru-ntrup/NTRU.git
cd NTRU/ntruhps2048677/aarch64_tmvp
cat << EOF > speed_polymul.c
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include "api.h"
#include "params.h"
#include "poly.h"
#include "tmvp.h"
#include "batch_multiplication.h"

#define NTESTS 1000

uint64_t time0, time1;
uint64_t cycles[NTESTS];

#ifdef __APPLE__

#include "m1cycles.h"
#define __AVERAGE__
#define SETUP_COUNTER() {(void)cycles; setup_rdtsc();}
#define CYCLE_TYPE "%lld"
#define GET_TIME rdtsc()

#else

#include "hal.h"
#undef __AVERAGE__
#define __MEDIAN__
#define SETUP_COUNTER() {}
#define CYCLE_TYPE "%ld"
#define GET_TIME hal_get_time()


static int cmp_uint64(const void *a, const void *b){
    return ((*((const uint64_t*)a)) - ((*((const uint64_t*)b))));
}

#endif

#ifdef __AVERAGE__

#define LOOP_INIT(__clock0, __clock1) { \
    __clock0 = GET_TIME; \
}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
    __clock1 = GET_TIME; \
    printf(__f_string, (__clock1 - __clock0) / NTESTS); \
}
#define BODY_INIT(__clock0, __clock1) {}
#define BODY_TAIL(records, __clock0, __clock1) {}

#elif defined(__MEDIAN__)

#define LOOP_INIT(__clock0, __clock1) {}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
    qsort(records, sizeof(uint64_t), NTESTS, cmp_uint64); \
    printf(__f_string, records[NTESTS >> 1]); \
}
#define BODY_INIT(__clock0, __clock1) { \
    __clock0 = GET_TIME; \
}
#define BODY_TAIL(records, __clock0, __clock1) { \
    __clock1 = GET_TIME; \
    records[i] = __clock1 - __clock0; \
}

#endif

#define WRAP_FUNC(__f_string, records, __clock0, __clock1, func){ \
    LOOP_INIT(__clock0, __clock1); \
    for(size_t i = 0; i < NTESTS; i++){ \
        BODY_INIT(__clock0, __clock1); \
        func; \
        BODY_TAIL(records, __clock0, __clock1); \
    } \
    LOOP_TAIL(__f_string, records, __clock0, __clock1); \
}

poly src1, src2;
poly des;

int main(int argc, char *argv[])
{
    int iters = 1;

    SETUP_COUNTER();

    if (argc > 1) {
        iters = atoi(argv[1]);
    }

    if (iters == 1) {
        WRAP_FUNC("polymul (Z_q[x] / (x^677 - 1)): " CYCLE_TYPE " \n",
                cycles, time0, time1,
                poly_Rq_mul(&(des), &(src1), &(src2)));
    }

    return 0;
}
EOF

make -j16 speed_polymul

cp speed_polymul abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

echo Type your password for sudo now
sudo echo Password OK

sudo ./speed_polymul > /dev/null # First run is slower (possibly has to do with verifying code signature)
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ > /dev/null # First run is slower (possibly has to do with verifying code signature)

echo Running speed_polymul 5 times
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul

echo Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

(I apologize for the need for sudo, but the cycle counters in Apple Silicon CPUs can only be accessed as root. I understand if anyone is paranoid and would rather not run this code.)

What this script does is:

Clones a GitHub repository containing artifacts from a scientific paper which implements cryptographic code optimized for ARMv8 CPUs, and associated benchmarking code using the CPU's cycle counter.
Modifies the benchmarking code in a way that exhibits the issue to be described. The modifications are as follows:
- restricts the benchmark to a single function, removing the others;
- move some of the local variables in main() to the global scope;
- introduces some logic to read a command-line argument to control the number of iterations of the benchmark (although this isn't actually used in the test itself; however, the presence of this code appears necessary to trigger the issue).
Compiles the benchmark (speed_polymul);
Creates a copy of speed_polymul with a different name (abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ) -- this is an identical byte-per-byte copy using the cp command-line utility
Runs each of the binaries 5 times in a row

Result

The relevant part of the output (step 5) above is always similar to this:

Running speed_polymul 5 times
polymul (Z_q[x] / (x^677 - 1)): 6881 
polymul (Z_q[x] / (x^677 - 1)): 6916 
polymul (Z_q[x] / (x^677 - 1)): 6864 
polymul (Z_q[x] / (x^677 - 1)): 6883 
polymul (Z_q[x] / (x^677 - 1)): 6881 
Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
polymul (Z_q[x] / (x^677 - 1)): 7179 
polymul (Z_q[x] / (x^677 - 1)): 7186 
polymul (Z_q[x] / (x^677 - 1)): 7177 
polymul (Z_q[x] / (x^677 - 1)): 7195 
polymul (Z_q[x] / (x^677 - 1)): 7179

There is a clear difference of ~5% in performance, which is fully reproducible for me. Note this is CPU-only code, which doesn't perform any kind of I/O.

It would be natural to think this is related to running the benchmark only once (actually 1000 times -- see NTESTS in speed_polymul.c), but I also have a version of the code which runs it in a loop (each one an average of 1000 runs, as just stated), and the issue equally happens if I run it for 100,000 iterations. The version above is the simplest one I could create that reliably reproduces the problem for me.

Context

I have a project which generates many executables -- which, obviously have distinct file names -- to benchmark identical code with different inputs, for a variable number of iterations, passed as a command-line parameter. So I am being bitten by this issue, as different file names running exactly the same code, only for different inputs (and the inputs do not matter, as this is constant-time crypto code), produce different cycle counts.

Also this isn't an issue of "maybe the crypto code isn't constant time after all and execution time varies with the inputs", since I can just make a bit-for-bit identical copy of the executable with a different name (exactly as the script does), and I get these unquestionably different cycle counts in each case.

Questions

What could possibly explain this difference?
How do I get rid of it, getting stable results regardless of file name?

Hypothesis

My hypothesis for the difference is that this is related to the size of the buffer pointed to by argv[0] in main(), and that this is somehow triggering some kind of alignment issue on the variables used by the code. However, this still needs more investigation to pinpoint the exact chain of events leading to this issue.

To that end, I can confirm that argv[0] indeed changes. I have added a printf() statement to print the addresses of all variables in the code. These are the outputs when running the code with two different names that produce different results:

&time0 = 0x10071b428 &time1 = 0x10071b430 cycles = 0x100718408 argv = 0x16f703b18 argv[0] = 0x16f703c10 &src1 = 0x10071a8e8 &src2 = 0x10071ae88 &des = 0x10071a348
polymul (Z_q[x] / (x^677 - 1)): 7242

&time0 = 0x102b13428 &time1 = 0x102b13430 cycles = 0x102b10408 argv = 0x16d30bb08 argv[0] = 0x16d30bc08 &src1 = 0x102b128e8 &src2 = 0x102b12e88 &des = 0x102b12348
polymul (Z_q[x] / (x^677 - 1)): 6878

Looking at the addresses modulo 256 (i.e. the last two nibbles in each address), the only difference is in argv and argv[0]. However, it is unclear why this would affect either the function being benchmarked (which only depends on src1, src2 and des) or the code that measures clock cycles (which depends on time0, time1 and cycles, but not argv or argv[0]).

Belt answered 24/12, 2023 at 19:24 Comment(11)

If you reverse the order of the tests, does the slowness still go with the longer filename? How about with a much longer filename? – Absorbefacient 24/12, 2023 at 19:27

Yes, argv[0] would indeed be different. – Systematic 24/12, 2023 at 20:2

"read a command-line argument" That'll do it. You could try larger and larger filenames and see if performance scales. You could try passing args via environment variables, might be simpler, or ensure all filename lengths are the same. However, benchmark program startup cost should not noticably affect your benchmarks; you may need to rethink your set up. The benchmark should run long enough to make such tiny fluctuations not matter and do the timing after startup and before shut down. – Sniper 24/12, 2023 at 20:9

@AndrewMorton the slowness is always tied to the filename, regardless of the order of execution. Trying with many different-length file names, the performance flip-flops between both: 1-5 letters fast, 6-10 letters slow, 11-14 letters fast, 15-21 letters slow, 22-26 letters fast (test stopped at 26 letters). This is again reproducible. – Belt 24/12, 2023 at 20:57

@Systematic yes argv[0] is different. What isn't immediately clear is why it would affect the performance of the rest of the code. – Belt 24/12, 2023 at 20:58

@Sniper it's not about startup cost. See the following part in the question: "I also have a version of the code which runs it in a loop (each one an average of 1000 runs, as just stated), and the issue equally happens if I run it for 100,000 iterations." This runs for almost a minute, and the same problem happens for the 1st all the way up to the 100,000-th iteration. I presume I could leave it running for a day or a month and it would keep exhibiting the issue. – Belt 24/12, 2023 at 20:59

@Belt If you assign the filename to a variable and use that instead of the literal, does it change the timings? – Absorbefacient 24/12, 2023 at 21:44

@AndrewMorton I'm not sure if I follow. You mean in the shell script? So say NAME=speed_polymul then sudo ./${NAME}, and the same for the longer name? If that's the case, then I just tried it and it made no difference. – Belt 24/12, 2023 at 22:12

I have seen something similar on Intel CPUs where the loop code under test almost exactly fits into one or two cache lines and the exact alignment of the branch and its target destination with the cache line boundaries makes a serious difference to execution times. It manifests as sudden changes in benchmark times for unchanged executable code that has shifted in location due to code modifications elsewhere (or of compiler generated SIMD objects badly aligned on the stack due to command line parameters). I'm still trying to pin down a similar problem on x86 CPUs. So far an MRE eludes me. – Bruch 24/12, 2023 at 22:23

Could you try altering argv[0] as its passed to the target executables, just to be sure? You may be able to use exec -a fakeName ./executable, otherwise just create a small wrapper program that uses execl / execv for the same. – Tucci 24/12, 2023 at 23:56

@DillonDavis just tried that using two different fakeNames (of different sizes) and the issue manifests: one is faster, the other slower. – Belt 25/12, 2023 at 0:14

Your hypothesis (triggered by buffer alignment) sounds plausible. But I think your expectations are too high. A systematic difference of 5% in runtime can happen for any reason not under your control.

It's a bit old, but have a look at the Stabilizer slides (or this review of their paper). They specifically mention argc/argv for affecting the stack layout. And as you benchmark to make your code faster, you will compare different compilations. You should expect that any unrelated change - in your code, the OS, compiler, CPU, or a library - can reverse a 5% difference that you measured earlier.

From the review cited above:

The SPECint 2000 “perl” benchmark (which is just the Perl interpreter running some – completely deterministic! – code) is infamous for having +-5% swings (this happens for multiple architectures) whenever anything in the executable changes.

Update: a more recent article, Performance Roulette, has a deep-dive how function alignment can cause this.

Your second question, how to get stable results: I think Stabilizer got that right, you have to randomize your benchmark and average over multiple runs. Try to trigger the effect somehow else, e.g. by calling alloca() or malloc() in main() to randomize stack/heap position. Then average it out, such that the executable name no longer matters.

There is also Coz, the causal profiler, by the same authors as Stabilizer. I haven't used it myself, but it seems relevant. (Update: On second glance, Coz is about multi-threading, it doesn't seem to do anything about stack layout, so it wouldn't have helped.)

Alternatively you could take an engineering stance, and whenever you end up with a 5% difference you say it's within tolerance and pick the more readable version.

Fundy answered 25/12, 2023 at 8:4 Comment(0)

Adding to @maxy's excellent answer, here are the results of a few more investigations that are closing in on the problem.

First, a confirmation that argv[0] is stored in the stack, by printing the address of the iters local variable as well as the result of __builtin_frame_address(0), and we note they're all < 1 KB apart. We have for the slow version:

argv = 0x16f8c7ac8 argv[0] = 0x16f8c7bd0 &iters = 0x16f8c783c sp = 0x16f8c7870

And for the fast version:

argv = 0x16d8dfa50 argv[0] = 0x16d8dfb80 &iters = 0x16d8df7cc sp = 0x16d8df800

It is seen that the stack pointer is aligned to a nice boundary (2048 bytes) in the fast case, versus only 16 bytes for the slow case.

Apparently all that would be needed is to align the stack pointer to a larger boundary (perhaps 64 or even 32 would be enough), and there is a nice clang option for this: -mstack-alignment (see also -mstackrealign).

Unfortunately, for some reason, adding these options to CFLAGS in the Makefile appears to have no effect. @maxy's idea to add a call to alloca() in main() isn't helping either. I will continue investigating and see if I can find a reliable way to influence stack alignment. EDIT: I have created a separate question about the -mstack-alignment issue. Feel free to address this issue over there if you have any idea what's going on.

Belt answered 25/12, 2023 at 14:33 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++