Reproducing the issue
Run the following shell script on an Apple Silicon Mac (tried on a MacBook Air 2020 with M1 and a MacBook Pro 16" with M3 Max; both running macOS Sonoma, with Apple clang 15.0):
#!/bin/bash
rm -rf NTRU
git clone https://github.com/vector-polymul-ntru-ntrup/NTRU.git
cd NTRU/ntruhps2048677/aarch64_tmvp
cat << EOF > speed_polymul.c
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include "api.h"
#include "params.h"
#include "poly.h"
#include "tmvp.h"
#include "batch_multiplication.h"
#define NTESTS 1000
uint64_t time0, time1;
uint64_t cycles[NTESTS];
#ifdef __APPLE__
#include "m1cycles.h"
#define __AVERAGE__
#define SETUP_COUNTER() {(void)cycles; setup_rdtsc();}
#define CYCLE_TYPE "%lld"
#define GET_TIME rdtsc()
#else
#include "hal.h"
#undef __AVERAGE__
#define __MEDIAN__
#define SETUP_COUNTER() {}
#define CYCLE_TYPE "%ld"
#define GET_TIME hal_get_time()
static int cmp_uint64(const void *a, const void *b){
return ((*((const uint64_t*)a)) - ((*((const uint64_t*)b))));
}
#endif
#ifdef __AVERAGE__
#define LOOP_INIT(__clock0, __clock1) { \
__clock0 = GET_TIME; \
}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
__clock1 = GET_TIME; \
printf(__f_string, (__clock1 - __clock0) / NTESTS); \
}
#define BODY_INIT(__clock0, __clock1) {}
#define BODY_TAIL(records, __clock0, __clock1) {}
#elif defined(__MEDIAN__)
#define LOOP_INIT(__clock0, __clock1) {}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
qsort(records, sizeof(uint64_t), NTESTS, cmp_uint64); \
printf(__f_string, records[NTESTS >> 1]); \
}
#define BODY_INIT(__clock0, __clock1) { \
__clock0 = GET_TIME; \
}
#define BODY_TAIL(records, __clock0, __clock1) { \
__clock1 = GET_TIME; \
records[i] = __clock1 - __clock0; \
}
#endif
#define WRAP_FUNC(__f_string, records, __clock0, __clock1, func){ \
LOOP_INIT(__clock0, __clock1); \
for(size_t i = 0; i < NTESTS; i++){ \
BODY_INIT(__clock0, __clock1); \
func; \
BODY_TAIL(records, __clock0, __clock1); \
} \
LOOP_TAIL(__f_string, records, __clock0, __clock1); \
}
poly src1, src2;
poly des;
int main(int argc, char *argv[])
{
int iters = 1;
SETUP_COUNTER();
if (argc > 1) {
iters = atoi(argv[1]);
}
if (iters == 1) {
WRAP_FUNC("polymul (Z_q[x] / (x^677 - 1)): " CYCLE_TYPE " \n",
cycles, time0, time1,
poly_Rq_mul(&(des), &(src1), &(src2)));
}
return 0;
}
EOF
make -j16 speed_polymul
cp speed_polymul abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
echo Type your password for sudo now
sudo echo Password OK
sudo ./speed_polymul > /dev/null # First run is slower (possibly has to do with verifying code signature)
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ > /dev/null # First run is slower (possibly has to do with verifying code signature)
echo Running speed_polymul 5 times
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
echo Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
(I apologize for the need for sudo
, but the cycle counters in Apple Silicon CPUs can only be accessed as root. I understand if anyone is paranoid and would rather not run this code.)
What this script does is:
- Clones a GitHub repository containing artifacts from a scientific paper which implements cryptographic code optimized for ARMv8 CPUs, and associated benchmarking code using the CPU's cycle counter.
- Modifies the benchmarking code in a way that exhibits the issue to be described. The modifications are as follows:
- restricts the benchmark to a single function, removing the others;
- move some of the local variables in
main()
to the global scope; - introduces some logic to read a command-line argument to control the number of iterations of the benchmark (although this isn't actually used in the test itself; however, the presence of this code appears necessary to trigger the issue).
- Compiles the benchmark (
speed_polymul
); - Creates a copy of
speed_polymul
with a different name (abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
) -- this is an identical byte-per-byte copy using thecp
command-line utility - Runs each of the binaries 5 times in a row
Result
The relevant part of the output (step 5) above is always similar to this:
Running speed_polymul 5 times
polymul (Z_q[x] / (x^677 - 1)): 6881
polymul (Z_q[x] / (x^677 - 1)): 6916
polymul (Z_q[x] / (x^677 - 1)): 6864
polymul (Z_q[x] / (x^677 - 1)): 6883
polymul (Z_q[x] / (x^677 - 1)): 6881
Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
polymul (Z_q[x] / (x^677 - 1)): 7179
polymul (Z_q[x] / (x^677 - 1)): 7186
polymul (Z_q[x] / (x^677 - 1)): 7177
polymul (Z_q[x] / (x^677 - 1)): 7195
polymul (Z_q[x] / (x^677 - 1)): 7179
There is a clear difference of ~5% in performance, which is fully reproducible for me. Note this is CPU-only code, which doesn't perform any kind of I/O.
It would be natural to think this is related to running the benchmark only once (actually 1000 times -- see NTESTS
in speed_polymul.c
), but I also have a version of the code which runs it in a loop (each one an average of 1000 runs, as just stated), and the issue equally happens if I run it for 100,000 iterations. The version above is the simplest one I could create that reliably reproduces the problem for me.
Context
I have a project which generates many executables -- which, obviously have distinct file names -- to benchmark identical code with different inputs, for a variable number of iterations, passed as a command-line parameter. So I am being bitten by this issue, as different file names running exactly the same code, only for different inputs (and the inputs do not matter, as this is constant-time crypto code), produce different cycle counts.
Also this isn't an issue of "maybe the crypto code isn't constant time after all and execution time varies with the inputs", since I can just make a bit-for-bit identical copy of the executable with a different name (exactly as the script does), and I get these unquestionably different cycle counts in each case.
Questions
- What could possibly explain this difference?
- How do I get rid of it, getting stable results regardless of file name?
Hypothesis
My hypothesis for the difference is that this is related to the size of the buffer pointed to by argv[0]
in main()
, and that this is somehow triggering some kind of alignment issue on the variables used by the code. However, this still needs more investigation to pinpoint the exact chain of events leading to this issue.
To that end, I can confirm that argv[0]
indeed changes. I have added a printf()
statement to print the addresses of all variables in the code. These are the outputs when running the code with two different names that produce different results:
&time0 = 0x10071b428 &time1 = 0x10071b430 cycles = 0x100718408 argv = 0x16f703b18 argv[0] = 0x16f703c10 &src1 = 0x10071a8e8 &src2 = 0x10071ae88 &des = 0x10071a348
polymul (Z_q[x] / (x^677 - 1)): 7242
&time0 = 0x102b13428 &time1 = 0x102b13430 cycles = 0x102b10408 argv = 0x16d30bb08 argv[0] = 0x16d30bc08 &src1 = 0x102b128e8 &src2 = 0x102b12e88 &des = 0x102b12348
polymul (Z_q[x] / (x^677 - 1)): 6878
Looking at the addresses modulo 256 (i.e. the last two nibbles in each address), the only difference is in argv
and argv[0]
. However, it is unclear why this would affect either the function being benchmarked (which only depends on src1
, src2
and des
) or the code that measures clock cycles (which depends on time0
, time1
and cycles
, but not argv
or argv[0]
).
argv[0]
is different. What isn't immediately clear is why it would affect the performance of the rest of the code. – BeltNAME=speed_polymul
thensudo ./${NAME}
, and the same for the longer name? If that's the case, then I just tried it and it made no difference. – Beltargv[0]
as its passed to the target executables, just to be sure? You may be able to useexec -a fakeName ./executable
, otherwise just create a small wrapper program that usesexecl
/execv
for the same. – TuccifakeName
s (of different sizes) and the issue manifests: one is faster, the other slower. – Belt