Is fastcall really faster?

Asked 2/2, 2010 at 23:56 Answered 3/2, 2010 at 6:52

Solved c++performance x86 calling-convention fastcall

Is the fastcall calling convention really faster than other calling conventions, such as cdecl? Are there any benchmarks out there that show how performance is affected by calling convention?

Agogue answered 2/2, 2010 at 23:56 Comment(4)

"How is performance affected by calling convention?" Marginally. – Myo 2/2, 2010 at 23:58

Except when it's affected massively. – Unseasonable 3/2, 2010 at 1:15

See also bcbjournal.org/articles/vol4/0004/… – Equilibrant 6/3, 2013 at 9:56

Some background may be found in this article: blogs.msdn.com/b/larryosterman/archive/2005/10/10/479278.aspx. To quote: "IIRC, back in the NT4 days, the entire NT kernel was recompiled with __fastcall and it got something like a 10% overall speedup. " – Contemplation 15/10, 2013 at 11:32

It depends on the platform. For a Xenon PowerPC, for example, it can be an order of magnitude difference due to a load-hit-store issue with passing data on the stack. I empirically timed the overhead of a cdecl function at about 45 cycles compared to ~4 for a fastcall.

For an out-of-order x86 (Intel and AMD), the impact may be much less, because the registers are all shadowed and renamed anyway.

The answer really is that you need to benchmark it yourself on the particular platform you care about.

Unseasonable answered 3/2, 2010 at 0:2 Comment(2)

More importantly, x86 CPUs are highly optimized for reloading recent stores, because this is very common in real code (especially across function boundaries, with pass-by-reference as well as stack args). Store-to-load forwarding makes the round trip only cost about 5 cycles of extra latency, with throughput limits only being the usual 2 or 3 loads per clock cycle. PowerPCs with huge penalties for reloading a recent store (instead of doing store-forwarding) are the exception, not the rule. I think most non-x86 CPUs, like modern ARM, also have store-forwarding. – Probably 1/9, 2022 at 16:59

What does "store-buffer forwarding" mean in the Intel developer's manual? / easyperf.net/blog/2018/03/09/… / and [Store-to-Load Forwarding and Memory Disambiguation in x86 Processors ](blog.stuffedcow.net/2014/01/x86-memory-disambiguation) - it's memory disambiguation, not register renaming, that makes it efficient. Out-of-order exec can hide those 5 cycle latencies better than an in-order CPU could, though. – Probably 1/9, 2022 at 17:2

Is the fastcall calling convention really faster than other calling conventions, such as cdecl?

I believe that Microsofts implementation of fastcall on x86 and x64 involves passing the first two parameters in registers instead of on the stack.

Since it typically saves at least four memory accesses, yes it is generally faster. However, if the function involved is register-starved and is thus likely to write them to locals on the stack anyway, there's not likely to be a significant increase.

Sentimentality answered 3/2, 2010 at 0:1 Comment(4)

In x64 there is only one calling convention – Fireproof 23/8, 2013 at 9:10

@Fireproof How exacttly is there one calling convention? On Windows x86_64 mingw-w64 C++11, __attribute__((fastcall)) compiles and produces a fastcall-compatible function. Besides, an achitecture cannot standartize calling conventions since they are a compiler feature. – Wyattwyche 21/4, 2019 at 13:1

@VladislavToncharov of course I'm specifically mentioning the calling convention on 64-bit windows, since this question is talking about "Microsoft's implementation". Calling convention is defined by the platform, not the compiler. GCC on Windows still have to follow Windows' convention when interacting without outside components – Fireproof 21/4, 2019 at 15:50

The Windows x64 calling convention passes 4 args in registers. And yes they call it fastcall, at least now to distinguish from vectorcall which is almost the same. See Why does Windows64 use a different calling convention from all other OSes on x86-64? – Probably 1/9, 2022 at 17:26

Calling convention (at least on x86) doesn't really make much of a difference in speed. In Windows, _stdcall was made the default because it produces tangible results for nontrivial programs in that it usually results in smaller code size when compared with _cdecl. _fastcall is not the default value because the difference it makes is far less tangible. What you make up for in argument passing via registers you lose in less efficient function bodies (as previously mentioned by Anon.). You don't gain anything by passing in registers if the called function immediately needs to spill everything out into memory for its own calculations.

However, we can spout theoretical ideas all day long -- benchmark your code for the right answer. _fastcall will be faster in some cases, and slower in others.

Rupertruperta answered 3/2, 2010 at 5:32 Comment(0)

On modern x86 - no. Between L1 cache and in-lining there's no place for fastcall.

Snippet answered 3/2, 2010 at 6:52 Comment(5)

If a function is inlined it is neither fastcall nor cdecl nor any other calling convention. – Unseasonable 3/2, 2010 at 7:15

Exactly. Fetching from L1 is 1 cycle over register - in most cases it's below noise level, it's hard to even benchmark it reliably. And functions where a few cycles on call are important difference should be inlined anyway. – Snippet 3/2, 2010 at 7:45

I have to agree with this - any function that is simple enough to benefit from fastcall would benefit from inlining even more. – Oao 26/10, 2012 at 16:8

Except that inlining isn't always possible. Think callbacks from code implemented by two different parties ... – Marcin 22/10, 2018 at 13:59

Saving instructions by passing args in registers makes code slightly smaller and faster. It's a small benefit that adds up over a whole program. That's why all x86-64 calling conventions use some register args, like basically all non-x86. It may not be worth spending extra effort to manually enable it for 32-bit code, but I wouldn't say there's "no place for it". It's less important than making sure cross-file inlining is enabled (link-time optimization), especially for projects with lots of small functions defined in .cpp files instead of .h, but it's still useful. – Probably 1/9, 2022 at 17:30

Recommended topics

Hot tags