Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

Asked 7/11, 2010 at 2:48 Answered 5/2, 2015 at 21:3

The C standard is quite unclear about the uint_fast*_t family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t and uint_fast32_t are both 8 bytes in size. However, multiplication of 8-byte numbers seems to be fairly slower than multiplication of 4-byte numbers. The following piece of code demonstrates that:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

int
main ()
{
  uint_least16_t p, x;
  int count;

  p = 1;
  for (count = 100000; count != 0; --count)
    for (x = 1; x != 50000; ++x)
      p*= x;

  printf("%"PRIuLEAST16, p);
  return 0;
}

Running the time command on the program, I get

real 0m7.606s
user 0m7.557s
sys  0m0.019s

If I change the type to uint_fast16_t (and the printf modifier), the timing becomes

real 0m12.609s
user 0m12.593s
sys  0m0.009s

So, would it not be much better if the stdint.h header defined uint_fast16_t (and also uint_fast32_t) to be a 4-byte type?

Banns answered 7/11, 2010 at 2:48 Comment(1)

Just a tiny tip: You don't need to include both, stdint.h and inttypes.h; according to ISO C standard, inttypes.h must always include stdint.h, so including it first is just a waste of time (not that it is forbidden or incorrect, just unnecessary). – Reiter 28/8, 2012 at 10:42

AFAIK compilers only define their own versions of (u)int_(fast/least)XX_t types if these are not already defined by the system. That is because it is very important that these types are equally defined across all libraries/binaries on a single system. Otherwise, if different compilers would define those types differently, a library built with CompilerA may have a different uint_fast32_t type than a binary built with CompilerB, yet this binary may still link against the library; there is no formal standard requirement that all executable code of a system has to be built by the same compiler (actually on some systems, e.g. Windows, it is rather common that code has been compiled by all kind of different compilers). If now this binary calls a function of the library, things will break!

So the question is: Is it really GCC defining uint_fast16_t here, or is it actually Linux (I mean the kernel here) or maybe even the Standard C Lib (glibc in most cases), that defines those types? Since if Linux or glibc defines these, GCC built on that system has no choice other than to adopt whatever conventions these have established. Same is true for all other variable width types, too: char, short, int, long, long long; all these types have only a minimum guaranteed bit width in the C Standard (and for int it is actually 16 bit, so on platforms where int is 32 bit, it is already much bigger than would be required by the standard).

Other than that, I actually wonder what is wrong with your CPU/compiler/system. On my system 64 bit multiplication is equally fast to 32 bit multiplication. I modified your code to test 16, 32, and 64 bit:

#include <time.h>
#include <stdio.h>
#include <inttypes.h>

#define RUNS 100000

#define TEST(type)                                  \
    static type test ## type ()                     \
    {                                               \
        int count;                                  \
        type p, x;                                  \
                                                    \
        p = 1;                                      \
        for (count = RUNS; count != 0; count--) {   \
            for (x = 1; x != 50000; x++) {          \
                p *= x;                             \
            }                                       \
        }                                           \
        return p;                                   \
    }

TEST(uint16_t)
TEST(uint32_t)
TEST(uint64_t)

#define CLOCK_TO_SEC(clock) ((double)clockTime / CLOCKS_PER_SEC)

#define RUN_TEST(type)                             \
    {                                              \
        clock_t clockTime;                         \
        unsigned long long result;                 \
                                                   \
        clockTime = clock();                       \
        result = test ## type ();                  \
        clockTime = clock() - clockTime;           \
        printf("Test %s took %2.4f s. (%llu)\n",   \
            #type, CLOCK_TO_SEC(clockTime), result \
        );                                         \
    }

int main ()
{
    RUN_TEST(uint16_t)
    RUN_TEST(uint32_t)
    RUN_TEST(uint64_t)
    return 0;
}

Using unoptimized code (-O0), I get:

Test uint16_t took 13.6286 s. (0)
Test uint32_t took 12.5881 s. (0)
Test uint64_t took 12.6006 s. (0)

Using optimized code (-O3), I get:

Test uint16_t took 13.6385 s. (0)
Test uint32_t took 4.5455 s. (0)
Test uint64_t took 4.5382 s. (0)

The second output is quite interesting. @R.. wrote in a comment above:

On x86_64, 32-bit arithmetic should never be slower than 64-bit arithmetic, period.

The second output shows that the same thing cannot be said for 32/16 bit arithmetic. 16 bit arithmetic can be significantly slower on a 32/64 bit CPU, even though my x86 CPU can natively perform 16 bit arithmetic; unlike some other CPUs, like a PPC, for example, that can only perform 32 bit arithmetic. However, this only seems to apply to multiplication on my CPU, when changing the code to do addition/subtraction/division, there is no significant difference between 16 and 32 bit any longer.

The results above are from an Intel Core i7 (2.66 GHz), yet if anyone is interested, I can run this benchmark also on an Intel Core 2 Duo (one CPU generation older) and on an Motorola PowerPC G4.

Reiter answered 28/8, 2012 at 11:39 Comment(2)

32 and 16 bit operand-sizes are very different on x86-64. 32-bit zero-extends to avoid false dependencies, but 16-bit dates back to 386 and merges into the old value of the destination registers. And on Nehalem and earlier (first-gen i7), could create partial-register merging stalls. Why doesn't GCC use partial registers? (it doesn't always avoid them). Your loop is already dependent on multiply latency, not the 1/clock throughput, with a rep count too high for OoO exec to overlap much, so a plain false dependency doesn't explain it. – Mountebank 2/3, 2019 at 6:53

The document that defines the sizes of uint_fast16_t and so on would normally be the ABI. But for some reason, the x86-64 System V ABI doesn't mention them, so it's just up to glibc to define them. 8-byte was a poor choice for x86-64 but we're now stuck with it (because their sizes are a de-facto part of the ABI). So int_fast32_t is a terrible idea for most arrays and many other uses on a major set of C/C++ implementations, basically ruining those types for most use-cases. It does save an instruction (to sign extend to 64) when passing a function arg in a reg to be used as an index. – Mountebank 2/3, 2019 at 7:0

I think that such a design decision is not simple to take. It depends on many factors. For the moment I don't take your experiment as conclusive, see below.

First of all there is no such thing like one single concept of what fast should mean. Here you emphasized on multiplication in place, which is just one particular point of view.

Then x86_64 is an architecture and not a processor. So outcomes might be quite different for different processors in that family. I don't think that it would be sane that gcc would have the type decision depend on particular commandline switches that optimize for a given processor.

Now to come back to your example. I guess you have also looked at the assembler code? Did it e.g use SSE instructions to realize your code? Did you switch processor specific options on, something like -march=native?

Edit: I experimented a bit with your test program and if I leave it exactly as it is I can basically reproduce your measurements. But modifying and playing around with it I am even less convinced that it is conclusive.

E.g if I change the inner loop also to go downward, the assembler looks almost the same as before (but using decrement and a test against 0) but the execution takes about 50% more. So I guess the timing depends very much on the environment of the instruction that you want to benchmark, pipeline stalls, whatever. You'd have to bench codes of very different nature where the instructions are issued in different contexts, alignment problems and vectorization come to play, to make a decision what the appropriate types for the fast typedefs are.

Mete answered 7/11, 2010 at 7:44 Comment(7)

"I don't think that it would be sane that gcc would have the type decision depend on particular commandline switches that optimize for a given processor." As R. says, that's an ABI question. If the compiler says that the ABI is independent of those switches, and the ABI specifies the sizes of those typedefs, then of course it can't vary. Whether it's sane to use the fast types in a library interface is also questionable, I think ;-) – Gazzo 7/11, 2010 at 11:3

@Steve, I completely agree. My point was just that for a given ABI that applies eventually to a whole family of processors, you can't deduce much for the validity of a particular choice by doing just one test, with one type of instruction, on one particular processor and with one particular compiler. – Mete 7/11, 2010 at 11:24

@Jens: yes, I didn't intend to disagree, just add that varying those types by compiler options isn't just "insane" or user-unfriendly in a vague sense, there's a specific issue with call compatibility. – Gazzo 7/11, 2010 at 11:44

@Jens: On x86_64, 32-bit arithmetic should never be slower than 64-bit arithmetic, period. Even if they're the same speed, the smallest of the 2 types that can hold the value should be chosen. [u]intfast*_t are just wrong. – Icky 7/11, 2010 at 13:15

@R.: If you already happen to have the beast in a register and your operation is done by the "classical" cpu, yes. If you have to load it from some 32bit alignment and the efficient version of the operation that you want to perform is on SSE, maybe not. – Mete 7/11, 2010 at 14:13

@Jens Gustedt: I compiled the program with "-march=native -O2 -pipe -Wall -Wl,-O1 -Wl,--as-needed", basically gentoo's recommended set of options. I had not looked at the assembly, but now I did and I was not surprised. The "least" version emits 32bit multiplications and the "fast" version emits 64bit multiplications. It is interesting that, inheriting from a CISC architecture, x86_64 has arithmetic instructions for 8, 16, 32 and 64 bits. – Incommunicable 7/11, 2010 at 15:51

I was wrong, vorbis does not use int_fast_t. Who does is ffmpeg's libavcodec in its vorbis_dec.c and vorbis.c files. – Incommunicable 7/11, 2010 at 17:17

Actual performance at runtime is a very very complicated topic. With many factors ranging from Ram memory, hard-disks, OS'es; And the many processor specific quirks. But this will give you a rough run down for you:

N_fastX_t

Optimal size to calculate most (addition and subtraction) operations efficiently for the processor. This is hardware specific, in which 32bit variables is native and faster (and hence used) over 16bit variables. (for example)
As it does not benefit as well as N_leastX in terms of cacheline hits, this should be used mainly when only a single variable is needed as fast as possible. While not being in a large array (how large is it optimally to switch between both is sadly platform dependent)
Note that fast vs least has several quirks case, mainly multiplication and divisions. That is platform specific. However if most operations are additions / subtrations / or / and. It is generally safe to assume fast is faster. (once again note the CPU cache and other quirk)

N_leastX_t

The smallest variable the hardware will allow, that is at least of the X size. (some platform is unable to assign variables below 4 bytes for example. In fact most of your BOOL variables take up at least a byte, and not a bit)
May be calculated via CPU costly software emulation if hardware support does not exists.
May have performance penalty due to partial hardware support (as compared to fast) in a single operations basis.
HOWEVER : as it takes less variable space, it may hit the cache lines alot more frequently. This is much more prominent in arrays. And as such will be faster (memory cost > CPU cost) See http://en.wikipedia.org/wiki/CPU_cache for more details.

The Multiplication problem?

Also to answer why the larger fastX variable would be slower in multiplication. Is cause due to the very nature of multiplication. (being similar to what you were thought in school)

http://en.wikipedia.org/wiki/Binary_multiplier

//Assuming 4bit int
   0011 (3 in decimal)
 x 0101 (5 in decimal)
 ======
   0011 ("0011 x 0001")
  0000- ("0011 x 0000")
 0011-- ("0011 x 0001")
0000--- ("0011 x 0000")
=======
   1111 (15 in decimal)

However it is important to know that a computer is a "logical idiot". While its obvious to us humans to skip the trailing zeros step. The computer will still work it out (its cheaper then checking conditionally then working it out anyway). Hence this creates a quirk for a larger size variable of the same value

   //Assuming 8bit int
      0000 0011 (3 in decimal)
    x 0000 0101 (5 in decimal)
    ===========
      0000 0011 ("0011 x 0001")
    0 0000 000- ("0011 x 0000")
   00 0000 11-- ("0011 x 0001")
  000 0000 0--- ("0011 x 0000")
 0000 0000 ---- (And the remainders of zeros)
 -------------- (Will all be worked out)
 ==============
      0000 1111 (15 in decimal)

While i did not spam out the remainder 0x0 additions in the multiplication process. It is important to note that the computer will "get them done". And hence it is natural that a larger variable multiplication will take longer time then its smaller counterpart. (Hence its always good to avoid multiplication and divisions whenever possible).

However here comes the 2nd quirk. It may not apply to all processors. It is important to note that all CPU operations are counted in CPU cycles. In which in each cycle dozens (or more) of such small additions operations is performed as seen above. As a result, a 8bit addition may take the same amount of time as an 8bit multiplication, and etc. Due to the various optimizations and CPU specific quirks.

If it concerns you that much. Go refer to Intel : http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Additional mention about CPU vs RAM

As CPU have advance to moore's law to be several times faster then your DDR3 RAM.

This can result to situations where more time is spent looking up the variable from the ram then to CPU "compute" it. This is most prominent in long pointer chains.

So while a CPU cache exists on most processor to reduce "RAM look-up" time. Its uses is limited to specific cases (where cache line benefits the most). And for cases when it does not fit. Note that the RAM look-up time > CPU processing time (excluding multiplication/divisions/some quirks)

Thibodeau answered 15/5, 2013 at 6:37 Comment(0)

Yes, I think this is simply a mistake. Unfortunately you can't just go fixing mistakes like this without breaking the ABI, but it may not matter since virtually nobody (and certainly no library functions I know of) actually uses the *int_fast*_t types.

Icky answered 7/11, 2010 at 3:3 Comment(11)

If I am not wrong the vorbis library does. Also, everytime gcc upgrades from gcc-x.y.z to gcc-x.y+1.w the ABI may break, so we could make that change at that time. – Incommunicable 7/11, 2010 at 4:17

Does vorbis use these types in the public interface, or just internally? I would hope just the latter. Regardless of gcc policy, there's also an issue of library maintainers' policy. If gcc broke their ABIs, they would be pretty damn pissed, and probably institute workarounds to lock in the old type rather than changing with gcc. – Icky 7/11, 2010 at 5:28

@R. Just out of interest, does the linux x86_64 ABI include the fast types (or for that matter anything from stdint)?. It's a slight worry to me that ABI documents often ignore them and hence (one might argue) it is not safe to pass them as parameters unless you static link. – Gazzo 7/11, 2010 at 11:8

@Steve: I think we're working with slightly different meanings of the word ABI. Your question seems to be about published documentation, which I'm not sure about, while my idea of ABI is "any implementation details which affect the ability to link object files produced by different tools (or versions of the same tool) for the same machine architecture, OS, and API". Of course depending on your view of what constitutes API, one might say that this is really already an issue at the API level before you even get to ABI, but I usually consider implementation of types to be an ABI issue. – Icky 7/11, 2010 at 12:11

@Steve Jessop: Linux uses the SYS V x86-64 ABI, and it does not specify those types. – Downpour 7/11, 2010 at 13:3

@caf: so if it turns out that the calling convention used for int_fast32_t is undocumented on linux (because nothing documents the underlying type), would you agree that it can't be used across a dynamic linked interface with any guarantee of safety? Since different compilers for linux, or different compiler options, surely are permitted to use different underlying types if it's nowhere specified? Or am I missing something? The only docs I can find for stdint.h on linux say that int32_t is long, so I certainly don't trust those for 64 bit systems... – Gazzo 7/11, 2010 at 13:38

@R. Yes, slightly different I guess. You mean the ABI that the implementation uses, and I mean the ABI that the implementation/OS guarantees, maybe? Dynamic linking relies on the ABIs matching, whereas knowing that dynamic linking is going to work relies on them being guaranteed to match, across different compilers/versions claiming to use "the same ABI". But for example, if linux somewhere guarantees (perhaps unwisely) that on a 64 bit system, int_fast32_t is long, then we're sorted, regardless of whether the ABI document itself actually mentions fast types, so I was perhaps over-specific. – Gazzo 7/11, 2010 at 13:42

@Steve Jessop: Even static linking is a problem, if the two objects were compiled with different definitions for those types. I would stick to the base types for external interfaces (though you can always send in a suggestion to the authors of the ABI document, that they add the C99 extended types). – Downpour 7/11, 2010 at 13:53

@caf: true, fair point. I tend to assume (perhaps wrongly) that static linking will be for objects compiled with the same compiler, and it's less likely those types will vary with compiler options in the same compiler. Whereas if it's undocumented for the OS calling convention, then different compilers using different definitions of the fast types is a clear and present danger, especially since gcc seems to have made a non-obvious choice. I shouldn't have said "static linking" in the first place, I should have said "compile everything yourself" or something... – Gazzo 7/11, 2010 at 14:4

@caf: I might as well at least ask, I suppose. If someone puts their email address in a public document, is it automatically fair play for members of the public to email them about that document? I'm guessing, though, that they'd consider this to be in the same league as the contents of struct tm - Not Our Problem, Ask The Compiler Vendor. – Gazzo 7/11, 2010 at 14:13

@Steve Jessop: It would certainly seem reasonable to me. – Downpour 7/11, 2010 at 21:37

Just because I was curious about the fast integer types I benchmarked a real-life parser which, in its semantic part, used an integer type for indexing arrays and C++-containers. It performed a mix of operations rather than a simple loop and most of the program did not depend on the integer type chosen. Actually, for my particular data, any integer type would have been fine. So all versions produced the same output.

At assembly level there are 8 cases: four for the sizes and 2 for the signedness. The 24 ISO C type names must be mapped to the eight basic types. As Jens already stated a "good" mapping must consider the particular processor and the particular code. Therefore, in practice, we should not expect perfect results even though the compiler writers should know the generated code.

Many runs of the example were averaged so that the fluctuation range of the run time is just about 2 of the least given digit. For this particular setup the results were:

No runtime difference between int16_t / uint16_t and int64_t / uint64_t respectively.
Unsigned versions are much faster for int8_t / uint8_t and int32_t / uint32_t respectively.
Unsigned versions are always smaller (text and data segment) than signed versions.

Compiler: g++ 4.9.1, Options: -O3 mtune=generic -march=x86-64

CPU: Intel™ Core™ 2 Duo E8400 @ 3.00GHz

The mapping

|    |Integer|                                                                     |
|Sign|Size   | Types                                                               |
|    |[bits] |                                                                     |
|:--:|------:|:-------------------------------------------------------------------:|
| u  |   8   | uint8_t  uint_fast8_t   uint_least8_t                               |
| s  |   8   | int8_t   int_fast8_t    int_least8_t                                |
| u  |  16   | uint16_t uint_least16_t                                             |
| s  |  16   | int16_t  int_least16_t                                              |
| u  |  32   | uint32_t uint_least32_t                                             |
| s  |  32   | int32_t  int_least32_t                                              |
| u  |  64   | uint64_t uint_fast16_t  uint_fast32_t  uint_fast64_t uint_least64_t |
| s  |  64   | int64_t  int_fast16_t   int_fast32_t   int_fast64_t  int_least64_t  |

Sizes and Timings

|      | Integer |         |       |       |         |
| Sign | Size    |  text   |  data |  bss  | Time    |
|      | [bits]  | [bytes] |[bytes]|[bytes]| [ms]    |
|:----:|--------:|--------:| -----:|------:|--------:|
|  u   |    8    | 1285801 |  3024 |  5704 | 407.61  |
|  s   |    8    | 1285929 |  3032 |  5704 | 412.39  |
|  u   |   16    | 1285833 |  3024 |  5704 | 408.81  |
|  s   |   16    | 1286105 |  3040 |  5704 | 408.80  |
|  u   |   32    | 1285609 |  3024 |  5704 | 406.78  |
|  s   |   32    | 1285921 |  3032 |  5704 | 413.30  |
|  u   |   64    | 1285557 |  3032 |  5704 | 410.12  |
|  s   |   64    | 1285824 |  3048 |  5704 | 410.13  |

Disembodied answered 5/2, 2015 at 21:3 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags