In my opinion overhead calling C function have to be as low, as as setting registers rcx, rdx, rsi, rdi, doin some fastcall and getting out rax value. But i've heard of big overhead in cgo <…>
Your opinion is unfounded.
The reason calling from Go to C has noticeable overhead is due to the following reasons.
Let's first consider C
While not in any way required by the language, a typical C program compiled by a typical compiler and running on a typical OS as a regular process, heavily relies on the OS to carry out certain aspects of its runtime environment.
The supposedly most visible and important aspect is the stack: the kernel is responsible for setting it up after loading and initializing the program's image and before transferring execution to the entry point of the code of the newborn process.
Another crucial point is that, again, while not strictly required, most C programs rely on OS-native threads to implement multiple concurently executing flows through the program's code.
The function calls performed in the C code are typically compiled using the same ABI the target combination of the OS and hardware implement (unless, of course, the programmer had explicitly managed to tell the compiler to do otherwise—like, say, marking a specific function as having a different calling convention).
C has no automatic means of managing non-stack memory ("the heap").
Such management is typically done via the C's standard library functions of the malloc(3)
family.
These functions manage the heap and consider any memory allocated through them as "theirs" (which is quite logical).
C does not provide automatic garbage collection.
Let's recap: a typical program compiled from C: uses the OS-supplied threads and uses OS-supplied stacks in those threads; function calls most of the time follow the platform's ABI; heap memory is managed by a special library code; no GC.
Let's now consider Go
- Any bit of Go code (both that of your program and of the runtime) runs in so-called goroutines which are like super light-weight threads.
- The goroutine scheduler provided by the Go runtime (which is compiled/linked into any program written in Go) implements the so-called M×N scheduling of the goroutines—where M goroutines are multiplexed onto N OS-supplied threads, where M is typically way higher than N.
- Function calls in Go do not follow the target platform's ABI.
Specifically, AFAIK contemporary versions of Go pass all call arguments on the stack¹.
- A goroutine is always running on an OS-provided thread.
A goroutine which is waiting on some resource managed by the Go runtime (such as an operation on a channel, a timer, a network socket etc) does not occupy an OS thread.
When the scheduler selects a goroutine for execution, it has to assign it to a free OS thread which is in the possession of the Go runtime;
while the scheduler tries hard to place the goroutine onto the same thread it was executing on before being suspended, that not always succeeds, and so goroutines may freely migrate between different OS threads.
The points above naturally lead to goroutines having their own stacks which are completely independent of those provided by the OS for its threads.
The heap memory is managed by the Go runtime, automatically, and its done directly, no C stdlib is used for this.
Go has GC, and this GC is concurrent in that it runs completely concurrently with the goroutines executing the program's code.
The stacks used by goroutines are allocated on the heap using the memory manager provided by the Go runtime.
Unlike C, these stacks are reallocatable².
Let's recap: goroutines have their own stacks, use calling convention not compatible with neither the platform's ABI nor that of C, and may be executing on different OS threads at different points of their execution.
The Go runtime manages the heap memory directly (this includes the stacks of the goroutines) and has a fully-concurrent GC.
Let's now consider calls from Go to C
As you should supposedly see by now, the "worlds" of runtime environments in which the Go and C code runs are different enough to have big "impedance mismatch" which requires certain gatewaying when doing FFI—with non-zero cost.
In particular, when the Go code is about to call into C, the following must be done:
- The goroutine must be locked to the OS thread it's currently running on ("pinned").
- Since the target C call must be done according to the platform's ABI, the current execution context must be saved—at least those registers which will be trashed by the call.
- The
cgo
machinery must verify that any memory about to be passed to the target C call does not contain pointers to other memory blocks managed by Go, recursively—this is to allow the Go's GC to continue working concurrently.
- The execution must be switched from the goroutine stack to the thread's stack: a new stack frame must be created on the latter, and the parameters to the target C call must be placed there (and in the registers) according to the platform's ABI.
- The call is made.
- Upon return, the execution must be switched back to the goroutine's stack—again by gatewaying any returned results back to the stack frame of the executing goroutine.
As you could probably see, there are unavoidable costs, and placing values in some CPU registers is the most negligible of those costs.
What can be done about that
Generally, there are two vectors to attack the problem:
Make the calls to C infrequent.
That is, if each call to C carries out lenghy CPU-intensive calculations, the overhead of performing these calls may be speculated to be dwarfed by the gains of making the computations performed by these calls faster.
Write critical functions in assembly.
Go allows writing code directly in the assembly of the target H/W platform.
One "trick" which may allow you to get the best of both worlds is employing the ability of most industrial compilers to output the assembly language form of the function they compiled. So you may employ hard-core facilities provided by a C compiler such as auto-vectorisation (for SSEs) and aggressive optimisation, and then grab whatever it generated and wrap it in a thin layer of assembly which basically adapts the generated code to the native Go's ABI.
There's a host of 3rd-party Go packages which do this (say, this and that) and obviously the Go runtime does this as well.
¹ Since 1.17 Go is progressively switching to using register-based calling convention.
I have no information on whether this makes Go code compiled for particular GOOS/GOARCH
combos to follow their native ABIs or not.
Go 1.18 implements register calling convention on all supported OSes when compiled for 64-bit CPUs (or CPU modes).
² Before 1.4 goroutine stacks had even more interesting design: they could consist of multiple segments forming a linked list; when a stack wanted to grow beyond its current size, a new segment was allocated and linked to the last one. This was called "split stacks".