Why do programs run slower on first run after compile?
Asked Answered
E

3

13

I have this simple hello world program:

#include <stdio.h>

int main() {
  printf("Hello, world!\n");
  return 0;
}

I compiled this program with LLVM Clang (v15.0.1, built from Homebrew, so not Apple's version) like normal, then I ran and timed the outputs. To my surprise, the first time the program ran, it took nearly 10x longer than the second time, but the next three executions run much faster.

$ clang test.c -o test
$ time ./test
Hello, world!

real    0m0.169s
user    0m0.001s
sys 0m0.002s
$ time ./test
Hello, world!

real    0m0.017s
user    0m0.001s
sys 0m0.006s
$ time ./test
Hello, world!

real    0m0.004s
user    0m0.001s
sys 0m0.002s
$ time ./test
Hello, world!

real    0m0.008s
user    0m0.001s
sys 0m0.005s

I'm running this on an Intel Core i5 mac, running macOS Big Sur v11.6.8. The shell is the bash shipped with macOS.

Nothing in my code deals with time, and I don't think there's anything to cache, so I'm not sure why the first execution runs so slow. I suspect that the OS might be doing some kind of optimization, but I don't know what/how. What is the cause of this large discrepancy in runtimes?

Equipment answered 28/1, 2023 at 1:33 Comment(27)
CPUs warms up. Caches too. You don't want a nuclear plant running on max when noone is using it. Same thing with modern CPUs. They have a low speed until you challenge them and then they speed up (if that's how you've set it).Storied
@TedLyngmo I thought that might be the case, but even after 10 minutes the program still runs faster than the first time after compiling. It runs slower than 10ms, but still not nearly 169ms.Equipment
That's not a contradiction. Quite the opposite. The first time the program warmed up the system. Now you're running on a hot system.Storied
Try echo 3 | sudo tee /proc/sys/vm/drop_caches before next run.Purnell
I've seen the same effect on Linux. Perhaps the timing code itself had to get paged in from disk (it's a bash/zsh built-in, but demand-paging of executables works in 4k chunks). And/or the dynamic library code that runs as part of your binary. Try timing something else before the first timing of the newly-compiled binary, e.g. time /bin/true.Rocambole
CPU-frequency warmup doesn't explain it, @TedLyngmo. The half second between up-arrow and return to run it again is plenty of time for that CPU core to go back into a deep sleep state (and power down its caches). And the binary file containing the executable was just written so it's hot in DRAM (in the pagecache, ready to get mapped into memory). The linker won't have used direct IO to bypass the kernel's buffering/caching. But yes, DRAM cache of other files like the dynamic linker makes sense, not the executable itself. Except clang just ran, so not the dynamic linker or libs.Rocambole
@PeterCordes "CPU-frequency warmup doesn't explain it" - I'm 100% sure you're right. The speed and heat correlates (for me) well with what I've seen when I've done similar things and my conclusion was based on that observation only. I have not searched for other factors (no, need, my gf reports when the fans are starting to be a nuisance).Storied
It’s the dynamic linker loading everything in the first run, and running from cache after that.Thunderous
@TedLyngmo: CPU frequency decisions are made every 1 to 10 milliseconds, or less with hardware controllers. Running clang used way more CPU time than time ./test does, so if "recent" load more than a second ago was relevant, the first run should be the fastest. (Unless it takes so much longer to type time ./test than to up-arrow it...) With a software governor, a process using up its timeslice will typically trigger a jump to max frequency, but that'll age away very soon.Rocambole
maybe it's because of caching?Pistole
I always assumed it was just the MS antivirus checking out the executable the first time it was executed...Phlebosclerosis
@PeterCordes I'm not sure my takeaway from this is correct but "the first run should be the fastest" is quite the opposite from what I've seen on my very old rig (a ~15 year old Xeon).Storied
@TedLyngmo: Exactly, that's another way we can rule out the CPU-frequency warmup hypothesis, because it predicts the opposite of what happens. I also usually observe the same thing, that the first run of a newly-linked executable is slower, on Linux on an i7-6700k Skylake, running from tmpfs. If you put multiple runs truly back-to-back, like time ./test; time ./test on the same command line, or a shell loop, you could have an effect, but other effects are clearly more significant.Rocambole
@PeterCordes Oh.... I've monitored the CPUs cores and run tests in random order when I want to compare. I also fixed the CPUs cores at max speed (at one point) to rule spin-up/down out. This is when I've bombarded all cores with tasks and the logs of CPU speed and how quickly things are getting done have always seemed to correlate. (Sidenote: I got "CPU throttling" and then core after core winding down - before Dell released a microcode update).Storied
@RichardBarber Do you have any references in regard to caching of dynamically linked objects? I assume anything could be preloaded and the symbols would have to be relinked.Keep
@TedLyngmo: Yeah, under sustained load, the CPU frequency jumps up and then things run faster. But even 10 ms of no load is enough to drop back to idle, so this performance difference of the first run of a fresh executable on a hopefully-idle system isn't explainable by CPU frequency warm-up. Given the OP's report of a persistent effect even after 10 minutes, it could be some kind of caching, or as Dúthomhas suggested, a virus scan or other extra security checks the first time a newly-written file is executed. (Unlikely to be MS, though, on MacOS!).Rocambole
@TedLyngmo: On Linux with software CPU-frequency management, Why does this delay-loop start to run faster after several iterations with no sleep? shows that a process not using up a full scheduler timeslice doesn't keep the CPU frequency up. IDK if MacOS would use a similar algorithm or not.Rocambole
@PeterCordes I can fit that into what I saw back when I tuned it. My server is now really slow to react. Give it a second and then, "oh, work? fun" :-) I've left my desktop mostly "as delivered" and haven't had to do anything but to swap using the cabinet sensor as trigger for the fans. When the GPU starts, the CMOS sensor is closer to the action and then can trigger the fan to get the air pumping out a lot faster. Seems to keep the system a little happier.Storied
Looks like this is a dup. Consensus seems to be: disk caching. (And maybe AV.)Phlebosclerosis
@Dúthomhas: That question talks about sorting "a file of integers", which wouldn't get read during compilation so could be cold, unlike this which doesn't intentionally read other files. And talking about how it gets slow again after restarting the PC, so again clearly disk caching. That doesn't explain this; the binary itself is freshly written; still hot in the pagecache. (Maybe even still dirty; I don't think an OS would need to sync it to disk before mapping it into a process, but perhaps it does so anyway? Although I think I've seen this on Linux in tmpfs so that wouldn't explain it.)Rocambole
@Dúthomhas: So yes that's related, but this question is more specific; if it is disk caching, a good answer should identify exactly what file is cold, and show that warming it up via some method other than running this program (e.g. running another program?) makes the first run fast.Rocambole
@MichaelM. Could you try sync && sudo purge to eliminate disk caching?Keep
@MossRichardson The code runs in 0.087 seconds, much slower than runs after it, but about twice as fast as the post-compile run. It looks like disk caching is part of it, but I think there's a bit more to it.Equipment
Some good suggestions here. In my limited understanding of CPU/cache architecture, I'd have just assumed it was a result of caching. Probably not L1, but more likely L2 or L3 cache.Encase
None of these comments are remotely relevant. You're clearly running OSX, and the first execution of a binary runs up against the 'System Integrity Protection' (SIP) infrastructure. Hot caching may explain subsequent speedups to a point. It's possible to disable SIP. Don't.Unmanned
I took a hello world written in AArch64 assembly, compiled it with clang on macOS, ran it and left it alone for a few days. Now when I execute it with time ./helloworld, the first run is takes real 0m0.051s and subsequent runs real 0m0.006s to real 0m0.009s. I'm on this question because I've noticed this too for a while.Elflock
I would suggest trying to run the program from a in memory filesystem rather than disk to see if it is because of disk caching.Finnish
B
0

There is another factor that could contribute to the effect you observed, and that is the CPU cache. The first time that the program is run, all attempts to fetch instructions and data from the L1, L2, and L3 caches produce cache misses, and the CPU has to then fetch them from RAM or from the disk, both of which are much slower than the CPU caches. However, during the first run, the CPU caches are populated with the program's instructions and data, so that on subsequent runs of the program, the CPU is more likely to find the instructions and data in one of those CPU caches, and the program is likely to run faster as a result.

Blackthorn answered 8/6 at 18:53 Comment(8)
That was my first thought, but I don't think it's correct. The OP manually used up-arrow and return in the shell to do the second run, which was probably enough time for the OS to put the CPU into a deep sleep, flushing and powering down the per-core L1 and L2 caches, and maybe even L3 if all cores were in sleep states. (The OP says they have an Intel i5 CPU, which does work that way.)Rocambole
Also, the file was just written by the compiler so it should still be hot in L3 cache from the write system call by the linker that copied data from its buffer into the pagecache. Unless the kernel uses a cache-bypassing copy-from-user function like with NT stores.Rocambole
On the first point, I think the important cache in this case would be the L3 cache, since the probability of a program running on the same core in a subsequent execution is low, given the number of cores in the i5 chips. My sense of it is that the L1 and L2 caches are most useful for caching instructions and data in program loops, while the much larger L3 cache is for frequently run programs.Blackthorn
To the second point, unlike the page table, CPU cache isn't affected by the compiler writing a program to disk. Instructions and data are written to CPU cache only during the execution of a program.Blackthorn
The compiler/linker don't directly write the file to disk, they write it to the filesystem. Which is cached in memory, in the OS's pagecache. Pages in the pagecache are dirty until they're synced to disk (after some timeout or demand for pages). A write system call effectively does a memcpy inside the kernel, from user-space to the pagecache. Executing the newly-written file maps that same page into the address-space of the process that did the execve. So unless the memcpy's stores bypassed cache, the data is still potentially hot in L3 cache from being written recently.Rocambole
As I commented under the question last year, I reproduced the same effect on Linux on my own Skylake i7 system, and I know that the Linux kernel's copy_from_user uses rep movsb which does populate L3 cache.Rocambole
I agree that L3 cache is likely more important than L1 and L2. One thought I had is that the OS has to allocate memory for page tables and per-process bookkeeping stuff. And if the last thing that happened on the system was that same program exiting, the allocators might reuse the same physical memory, so the OS gets L3 hits when creating its internal data structures for the process.Rocambole
And/or the OS's allocator is faster because it can find blocks of the exact size it wants at the head of the free list? And other similar effects in terms of the OS's data structures, not just hardware caches.Rocambole
N
-1

Compiled languages, such as C, C++, and Rust, typically involve a two-step process: compilation and execution.

First Run (Compilation and Execution): Compilation: The source code is translated into machine code (a binary file) by a compiler. This process can be time-consuming, depending on the size and complexity of the code. Execution: Once compiled, the machine code is executed by the CPU. Subsequent Runs (Execution Only): No Need for Recompilation: If the source code has not changed, the compiler does not need to recompile the code. The already compiled binary file is directly executed. Execution: Since the compilation step is skipped, only the execution step occurs, making it significantly faster than the first run.

Nivernais answered 5/8 at 8:29 Comment(1)
The OP isn't including the compile time for the first run. Just the actual run. Compilation is done manually, not lazily just-in-time for the languages you mention. You can see this in the terminal session in the question; note that clang is done compiling before the time ./test.Rocambole
V
-2

The behavior you are observing is likely due to the way modern operating systems handle executable files' first-time execution. It's called "file system caching" or "disk caching." When you execute a program for the first time, the operating system loads the program from disk into memory, and subsequent executions are much faster because the OS keeps a cached copy of the program in memory.

Here's what typically happens during the first and subsequent runs:

First Execution:

The operating system loads the executable "test" from disk into memory. The program is executed, and the output is displayed. Since it's the first time running the program, the OS doesn't have a cached copy, and it takes some time to read from the disk and load the program into memory, which leads to a longer execution time. Subsequent Executions:

The operating system already has a cached copy of the "test" program in memory from the first execution. The cached copy is used directly, eliminating the need to read from the disk, resulting in much faster execution times. As you've observed, subsequent executions are significantly faster than the first one. The time difference you see between the first and second execution is likely due to the disk cache and memory cache behavior of the operating system. On subsequent runs, the program's binary and library dependencies may still reside in memory, which makes the execution much faster.

Additionally, the timing differences you see can also be affected by other background processes running on your system during different executions.

This behavior is not specific to macOS; it's a common feature of modern operating systems to improve the overall performance of frequently executed programs.

If you want to get a more consistent benchmark of your program's execution time, you can run it multiple times and take the average time from those runs after the initial cache warm-up.

Villon answered 4/8, 2023 at 21:45 Comment(1)
On the first run after a compile, the file was just written by the compiler so it's already hot in RAM from that. This was discussed in comments when the question was first posted. Guesses included the timing code or libraries needing to get paged in, but a C compiler probably warmed up the pages containing the libc functions this simple C program uses.Rocambole

© 2022 - 2024 — McMap. All rights reserved.