Fastest Linux system call
Asked Answered
E

4

10

On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel?

In particular, it must be a system call that exercises the syscall/sysret user <-> kernel transition1, but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.

Such a call could be used to estimate the raw syscall and sysret overhead independent of any work done by the call.


1 In particular, this excludes things that appear to be system calls but are implemented in the VDSO (e.g., clock_gettime) or are cached by the runtime (e.g., getpid).

Enjoy answered 21/2, 2018 at 18:34 Comment(8)
Why do you ask, and why do you care? Your question lacks a lot of motivation!Yaron
Why do you care why I care? Personally I don't adhere to the idea that any question needs to have a detailed motivation, as long as it is clear enough - it's an annoying aspect of SO that a particular subgroup answer almost every question with "Why would you care? XY Problem, etc". In any case, despite my feelings on the matter, I even included up motivation upfront since I figured someone would ask: Such a call could be used to estimate the raw sysenter and sysret overhead independent of any work done by the call.Enjoy
See the conversation immediately surrounding this post, for example. @BasileStarynkevitchEnjoy
Are you excluding the possibility of a do nothing system call created by a developer and added to the kernel?Roughspoken
I see you have responded in a comment about an unmodified kernel. That really should be captured in the question.Roughspoken
@MichaelPetch - yes, it should be the fastest existing call on modern kernels. In any case, I suspect a do-nothing call isn't the fastest anyways: the fastest is probably just an error path, e.g., "too high syscall number" which never even leaves the entry code. To be fair, I think it is implied that "adding your own system call" should be implicitly excluded on Linux questions, unless something indicates otherwise. Otherwise, any "How or can I do X on Linux" can simply be answered by "add your own syscall to do it" (and then try to convince everyone to use your custom kernel?).Enjoy
@MichaelPetch - I added "on a vanilla kernel". That's not really well defined, but I think it gets the idea across - if you have a better way of wording it, let me know.Enjoy
Related: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls is a paper that has some measurements/simulations of lowered IPC after a system-call returns.Nascent
M
10

One that doesn't exist, and therefore returns -ENOSYS quickly.

From arch/x86/entry/entry_64.S:

#if __SYSCALL_MASK == ~0
    cmpq    $__NR_syscall_max, %rax
#else
    andl    $__SYSCALL_MASK, %eax
    cmpl    $__NR_syscall_max, %eax
#endif
    ja  1f              /* return -ENOSYS (already in pt_regs->ax) */
    movq    %r10, %rcx

    /*
     * This call instruction is handled specially in stub_ptregs_64.
     * It might end up jumping to the slow path.  If it jumps, RAX
     * and all argument registers are clobbered.
     */
#ifdef CONFIG_RETPOLINE
    movq    sys_call_table(, %rax, 8), %rax
    call    __x86_indirect_thunk_rax
#else
    call    *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:

    movq    %rax, RAX(%rsp)
1:
Myatt answered 21/2, 2018 at 19:19 Comment(6)
That's the entry point for syscall from 64-bit mode, not for sysenter. There's somewhat more overhead for compat syscalls (syscall number only checked from C code), and the entry points are in entry_64_compat.S. But yes, an out-of-range syscall number appears to be the fastest.Nascent
@PeterCordes and Tim - sorry for the confusion. The whole time I should have been talking about syscall (i.e., the best way to make system calls in 64-bit code and the method that should be in the VDSO thunk), but I mistakenly wrote sysenter/sysret in the question (the sysret being correct at least). So given that is this the right entry point?Enjoy
@Enjoy yes, it is the right entry point. But note that with Meltdown mitigation enabled, the actual entry point is entry_SYSCALL_64_trampoline, so the kernel can avoid exposing the kernel-ASLR offset via the IDT (which has to be mapped even in user-space, and thus can be read by Meltdown)Nascent
@peter do you know if pti=off changes that trampoline behavior?Enjoy
@BeeOnRope: probably? I think it could, but IDK if it does. Can you find out with sudo perf record to see which instructions get counts inside the kernel?Nascent
This is the right answer. I had earlier reported it to be slower than simple syscalls like getuid but this was wrong: I was only measuring user time, not kernel time (the difference then was down to error handling as discussed here). Once corrected, this is the fastest, about ~117 cycles versus ~130 cycles for the fastest system calls once you disable all Meltdown and Spectre stuff and call it directly from asm using syscall.Enjoy
N
5

Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS instead of dispatching to a system-call handling function at all.

Unless this causes the kernel to use the iret slow path instead of sysret / sysexit. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid), because glibc error handling (setting errno) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret while returning -ENOSYS.


This answer is for sysenter, not syscall. The question originally said sysenter / sysret (which was weird because sysexit goes with sysenter, while sysret goes with syscall). I answered based on sysenter for a 32-bit process on an x86-64 kernel.

Native 64-bit syscall is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64 in 4.16-rc2).


My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S). This answer is just taking the relevant parts of that.

The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.

int 0x80 and sysenter have different entry points. You're looking for entry_SYSENTER_compat. AFAIK, sysenter always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS as the saved CS value, so it will always return to user-space in 32-bit mode.

After pushing registers to construct a struct pt_regs on the kernel stack, there's a TRACE_IRQS_OFF hook (no idea how many instructions that amounts to), then call do_fast_syscall_32 which is written in C. (Native 64-bit syscall dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).

do_syscall_32_irqs_on in arch/x86/entry/common.c is pretty light-weight: just a check if the process is being traced (I think this is how strace can hook system calls via ptrace), then

   ...
    if (likely(nr < IA32_NR_syscalls)) {
        regs->ax = ia32_sys_call_table[nr]( ... arg );
    }

    syscall_return_slowpath(regs);
}

AFAIK, the kernel can use sysexit after this function returns.

So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.

If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.

You'd probably also want to modify the ABI to pass a return address in a register (like syscall does on its own) instead of saved on the user-space stack which Linux's current sysenter ABI does; it has to get_user() to read the EIP value it should return to.


Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.

Nascent answered 21/2, 2018 at 19:44 Comment(12)
You'd think using an invalid syscall number would be faster, wouldn't you? It's about 17 cycles slower than say syscall(SYS_getpid) on my system, however: perhaps because the syscall() wrapper in glibc has to do extra work on an error return (e.g., setting errno)? Branch prediction isn't a concern here since I'm benchmarking this in a loop.Enjoy
@Enjoy Huh, 17 cycles is a lot, seems like more than glibc's error path should explain. You didn't say you were using glibc's generic syscall() wrapper, though. Maybe try going through the VDSO __vsyscall code for sysenter directly instead of via glibc, i.e. with the inline-asm macro from MUSL __syscall0(123456). Or if you're using 64-bit syscall, then it's easy to use it directly.Nascent
@BeeOnRope: you said sysenter in your question, not syscall; I was answering based on that. sysenter is hard to use directly, because you have to run it after saving an EIP return address on the user-space stack.Nascent
I got confused, I thought sysenter was still the way on Intel chips in 64-bit, but it seems like it is syscall. I will fix the question. I said sysret though, which is correct, so I was really mixing two things together!Enjoy
After further testing, it seems like it is the error handling code combined with the KPTI fix, slowing down ENOSYS. What seems to happen is that the test is very sensitive to the number of cache lines touched after the call returns since the TLB has been flushed due to KPTI (I'm on 4.13 so no PCID I think). The error handling probably touches a couple more lines to set errno and hence the worse performance. In particular, calling syscall from asm with no special error handling shows that the ENOSYS case is indeed the fastest (about tied with fast calls like getuid at around 40 cycles).Enjoy
With pti=off at boot, the asm loop (doesn't touch memory) barely changes but the other cases speed up and cluster much more closely, with 43, 47, and 51 cycles for getuid(), syscall(SYS_getuid) and syscall(123456) respectively, supporting the theory that KPTI was adding cost to the surrounding code (ENOSYS is still slower though). Calling SYS_getuid or 123456 (non-existent) are pretty much exactly tied at 39 cycles.Enjoy
You should ignore the numbers above, I realized finally I was only timing user-mode cycles. The actual times are in the order of ~1800 cycles, so I need to double check my work since that seems way too slow. Most of the cost is an wrmsr to register 0x48, which I think is a Spectre mitigation. I don't know how to turn it off.Enjoy
@BeeOnRope: you probably have to disable it at kernel config/build time to disable the wrmsr. It has to be in a fast path, so they probably don't add overhead checking for a run-time config. And maybe patching it in/out at boot is inconvenient, or not implemented yet. And yes, that is almost certainly the Spectre mitigation that makes future indirect branches not dependent on prediction info from earlier lower-privilege prediction data. Adding a virtual write-only MSR was something they could do purely in microcode, unlike a new instruction.Nascent
No all of the mitigations have off-switches, you can switch them off with noibrs and noibpb at boot, just like KPTI. In fact you can do it dynamically after boot in the /sys/kernel filesystem, as long as your distro included that in the patch. With those disabled, the syscall cost goes down to around a minimum of ~160 cycles.Enjoy
Just a note that the /sys/kernel switches, described here, are apparently only available on RHEL derived kernels, at least for now. You can still disable them on pretty much any mainline derived kernel using the boot parameters, however (makes it more annoying to do back-to-back testing with and without the options, however).Enjoy
@BeeOnRope: did you ever publish your results anywhere? I'd like to have a canonical link for Linux system-call costs with/without Meltdown or Spectre mitigation, on CPUs like Skylake, Ryzen, or whatever. (And the kernel has multiple workaround strategies, too, e.g. avoiding wrmsr if the microcode is buggy.)Nascent
No, not really. I did put some numbers somewhere in some comments or an answer, but I wouldn't really consider them authoritative: they could be distro specific and I didn't dig into some weirdness I saw e.g., with much worse results when disabling mitigations on the boot command line (i.e., mitigation on was faster than mitigation off). The results showed a pretty big slowdown though, something like 100ish cycles before any meltdown stuff to 700ish after.Enjoy
Y
3

Some system calls don't even go thru any user->kernel transition, read vdso(7).

I suspect that these VDSO system calls (e.g. time(2), ...) are the fastest. You could claim that there are no "real" system calls.

BTW, you could add a dummy system call to your kernel (e.g. some system call always returning 0, or a hello world system call, see also this) and measure it.

I suspect (without having benchmarked it) that getpid(2) should be a very fast system call, because the only thing it needs to do is fetch some data from the kernel memory. And AFAIK, it is a genuine system call, not using VDSO techniques. And you could use syscall(2) to avoid its caching done by your libc and forcing the genuine system call.

I maintain my position (given in a comment to your initial question): without actual motivation your question does not make any concrete sense. Then I still do think that syscall(2) doing getpid is measuring the typical overhead to make a system call (and I guess you really care about that one). In practice almost all system calls are doing more work that such a getpid (or getppid).

Yaron answered 21/2, 2018 at 18:49 Comment(1)
Indeed, but I am specifically excluding calls that don't enter the kernel (I'll make that requirement clearer in the question). The dividing line is that there must be a user/kernel transition (i.e., a sysenter call). About the dummy call, I want this to work on an unmodified kernel, and it's a heck of a lot of work compared to just using some existing fast call.Enjoy
G
3

In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999) (or some other fd not in use) is recommended.

Glaswegian answered 21/2, 2018 at 19:24 Comment(1)
Thanks for the links! close(999) seems like it's about ~20 cycles slower than getuid(). It's about 50 cycles versus 70 on my system with KPTI enabled.Enjoy

© 2022 - 2024 — McMap. All rights reserved.