Fastest integer type for common architectures

Asked 12/9, 2010 at 7:54 Answered 6/4 at 8:43

Solved c performance architecture integer stdint

The stdint.h header lacks an int_fastest_t and uint_fastest_t to correspond with the {,u}int_fastX_t types. For instances where the width of the integer type does not matter, how does one pick the integer type that allows processing the greatest quantity of bits with the least penalty to performance? For example, if one was searching for the first set bit in a buffer using a naive approach, a loop such as this might be considered:

// return the bit offset of the first 1 bit
size_t find_first_bit_set(void const *const buf)
{
    uint_fastest_t const *p = buf; // use the fastest type for comparison to zero
    for (; *p == 0; ++p); // inc p while no bits are set
    // return offset of first bit set
    return (p - buf) * sizeof(*p) * CHAR_BIT + ffsX(*p) - 1;
}

Naturally, using char would result in more operations than int. But long long might result in more expensive operations than the overhead of using int on a 32 bit system and so on.

My current assumption is for the mainstream architectures, the use of long is the safest bet: It's 32 bit on 32 bit systems, and 64 bit on 64 bit systems.

Cashbook answered 12/9, 2010 at 7:54 Comment(6)

Would the native int size be the fastest? IOW int. – Martinic 12/9, 2010 at 8:4

Surely you need to assign p to point at buf there? – Womenfolk 12/9, 2010 at 8:5

Actually, I don't see the point of "fast" types. For which operations are they fastest? For your application, 64-bit type might be the fastest, but for division the widest type is the slowest. – Yokefellow 12/9, 2010 at 15:51

Thanks @Jon Cage, I was very distracted when I wrote this code. – Cashbook 12/9, 2010 at 17:8

I understand, the code is simplified and I hope you take the alignment into account . Dereferencing unaligned pointer, if it doesn't cause some fatal error, can be pretty slow. – Atomy 12/9, 2010 at 17:25

@leppie: But int is not the native int size on for example amd64 systems. – Cashbook 23/4, 2015 at 11:6

For all existing mainstream architectures long is the fastest type at present for loop throughput.

Cashbook answered 14/9, 2010 at 2:45 Comment(1)

The fastest standard type, maybe. Most mainstream architectures with multimedia support have vector extensions. – Ralina 11/10, 2010 at 6:3

int_fast8_t is always the fastest integer type in a correct implementation. There can never be integer types smaller than 8 bits (because CHAR_BIT>=8 is required), and since int_fast8_t is the fastest integer type with at least 8 bits, it's thus the fastest integer type, period.

Collinsworth answered 12/9, 2010 at 14:45 Comment(6)

Theoretically this answer is correct (very impressive reasoning), but on my system: typedef unsigned char uint_fast8_t;, which I'm sure is not going to be the fastest. – Cashbook 12/9, 2010 at 15:15

On x86, all types except 16-bit (which is slow) are equally fast. – Collinsworth 12/9, 2010 at 16:35

Actually, a slight correction: for a single variable, they're all equally fast, but if you're using a large number of integer variables (especially an array), smaller types will of course be faster because of reduced impact on the cache. Thus, in a sense, char is the fastest type on x86. – Collinsworth 12/9, 2010 at 16:38

Well I won't say you're wrong without testing it first, but it sounds somewhat unexpected. Keep in mind that loop overhead is considered against any speedup that smaller types might give. Total throughput is being considered here. – Cashbook 12/9, 2010 at 17:9

Oh, well it seems I missed what you're trying to do, but your question was also misleading in saying you don't care about the size. You do care about the size because it affects how many times you have to loop. In that case, I would recommend just using size_t or uintptr_t, which are almost surely the system word size (int might not be). – Collinsworth 12/9, 2010 at 18:53

VC++2013 has typedef unsigned char uint_fast8_t;, is it correct? Does char provides the same speed as int? I remember when replacing unsigned char by unsigned int in crc8 implementation provided 4 times speedup on both x86 and PowerPC platforms – Corvin 23/7, 2015 at 11:7

I'm not sure I really understand the question, but why aren't you just using int? Quoting from my (free draft copy of the wrong, i. e. C++) standard, "Plain ints have the natural size suggested by the architecture of the execution environment."

But I think that if you want to have the optimal integer type for a certain operation, it will be different depending on which operation it is. Trying to find the first bit in a large data buffer, or finding a number in a sequence of integers, or moving them around, could very well have completely different optimal types.

EDIT:

For whatever it's worth, I did a small benchmark. On my particular system (Intel i7 920 with Linux, gcc -O3) it turns out that long ints (64 bits) are quite a bit faster than plain ints (32 bits), on this particular example. I would have guessed the opposite.

Cropeared answered 12/9, 2010 at 8:25 Comment(4)

int is not 64 bit on the very common LP64 x86_64 platform – Cashbook 12/9, 2010 at 9:0

@Matt: Yes. but will it automatically be faster if it's wider? I don't think so, but I haven't benchmarked anything. – Cropeared 12/9, 2010 at 9:13

Because there are only 5 different possible sizes for standard integer types (char, short, int, long, long long) and all the forces go towards having types of width 8, 16, 32, 64 and in near future 128, int will be stuck on 32 bit. Its definition will have nothing to do with efficiency on the platform, but just be constrained by that. On my machine I have that int is 32 bit and then that int_fast32_t is long which in turn is 64 bit. – Underwing 12/9, 2010 at 19:36

@Jens Gustedt: That throws out the possibility that int would be advanced when/if 32bit integers are no longer the fastest on x86, thanks. – Cashbook 13/9, 2010 at 8:43

Theoretically, int is the best bet. It should map to the CPU's native register size, and thus be "optimal" in the sense you're asking about.

However, you may still find that an int-64 or int-128 is faster on some CPUs than an int-32, because although these are larger than the register size, they will reduce the number of iterations of your loop, and thus may work out more efficient by minimising the loop overheads and/or taking advantage of DMA to load/store the data faster.

(For example, on ARM-2 processors it took 4 memory cycles to load one 32-bit register, but only 5 cycles to load two sequentially, and 7 cycles to load 4 sequentially. The routine you suggest above would be optimised to use as many registers as you could free up (8 to 10 usually), and could therefore run up to 3 or 4 times faster by using multiple registers per loop iteration)

The only way to be sure is to write several routines and then profile them on the specific target machine to find out which produces the best performance.

Fantom answered 12/9, 2010 at 8:27 Comment(2)

Except for the fact that I don't think int maps to the native register size on x86_64 (is this a property of LP64 traditions only and not the case on say ARM?), you make a good point! – Cashbook 12/9, 2010 at 8:59

@Matt: Yes, that's why I covered myself by starting with "theoretically" :-) The size of 'int' is ultimately compiler-defined, so this type of optimisation really has to be determined for every compiler on every platform. – Fantom 13/9, 2010 at 19:6

If you want to be certain you've got the fastest implementation, why not benchmark each one on the systems you're expecting to run on instead of trying to guess?

Womenfolk answered 12/9, 2010 at 8:3 Comment(2)

Justify why if this can be benchmarked on a platform by platform basis, that it isn't in stdint.h with it's bedfellows, the {,u}int_leastX_t types which presumably have undergone the same checks? – Cashbook 12/9, 2010 at 9:2

Well it depends on what operation / algorithm you're trying to implement. Look at Jason Williams answer for example. Those types might get you pretty close to the fastest solution, but there's probably a few instances where there are faster ways to do things. Ultimately, the only way to be sure is to benchmark. – Womenfolk 12/9, 2010 at 9:29

I would guess that the types size_t (for an unsigned type) and ptrdiff_t (for a signed type) will usually correspond to quite efficient integer types on any given platform.

But nothing can prove that than inspecting the produced assembler and to do benchmarks.

Edit, including the different comments, here and in other replies:

size_t and ptrdiff_t are the only typedefs that are normative in C99 and for which one may make a reasonable assumption that they are related to the architecture.

There are 5 different possible ranks for standard integer types (char, short, int, long, long long). All the forces go towards having types of width 8, 16, 32, 64 and in near future 128. As a consequence int will be stuck on 32 bit. Its definition will have nothing to do with efficiency on the platform, but just be constrained by that width requirement.

Underwing answered 12/9, 2010 at 8:15 Comment(2)

I suspect that size_t was picked to match the integer width required to address the maximum addressable virtual memory space. While it's likely the same as the native register width I don't think it was picked with performance in mind. Also a nitpick: ssize_t should be used for signed integers of this kind, ptrdiff_t should only be used for pointer subtraction :) (and {,u}intptr_t for general pointer manipulation :]) – Cashbook 12/9, 2010 at 9:4

@Matt: size_t and ptrdiff_t are the only types that are guaranteed to exist by C99, the others are optional. – Underwing 12/9, 2010 at 19:28

The answer is int itself. At least in C++, where 3.9.1/2 of the standard says:

Plain ints have the natural size suggested by the architecture of the execution environment

I expect the same is true for C, though I don't have any of the standards documents.

Terr answered 12/9, 2010 at 8:26 Comment(4)

Thanks for the standard quote, but how do you explain LP64 x86_64? – Cashbook 12/9, 2010 at 9:5

@Matt: I believe that 64-bit arithmetic on Intel x86_64 CPUs is slower than 32-bit arithmetic. My recollection is that AMD came out with a 64-bit version of the x86 general-purpose instruction set first, with 64-bit arithmetic that may or may not have been as fast as the 32-bit, and Intel had to catch up fast, so its 64-bit ops are internally implemented on the same 32-bit-wide data path as before and thus take twice as long as plain 32-bit ops. Can anyone confirm this? – Terr 12/9, 2010 at 14:13

This is definitely the best answer if we can get this confirmed. – Cashbook 12/9, 2010 at 15:16

@Matt: Might be worth searching SO questions for "64-bit" or similar... That's probably where I got it from. Sorry I don't have the time to do this myself right now. – Terr 13/9, 2010 at 13:35

For all existing mainstream architectures long is the fastest type at present for loop throughput.

Cashbook answered 14/9, 2010 at 2:45 Comment(1)

The fastest standard type, maybe. Most mainstream architectures with multimedia support have vector extensions. – Ralina 11/10, 2010 at 6:3

If you're compiling with gcc, i'd recommend using __builtin_ffs() for finding the first bit set:

Built-in Function: int __builtin_ffs (unsigned int x) Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.

This will be compiled into (often a single) native assembly instruction.

Wesleyanism answered 12/9, 2010 at 20:18 Comment(1)

The example was only for illustration. I used ffsX to denote non-specific width ffs implementation. – Cashbook 13/9, 2010 at 8:44

It is not possible to answer this question since the question is incomplete. As an analogy, consider the question:

What is the fastest vehicle

A Bugatti Veyron? Certainly fast, but no good for going from London to New York.

What is missing from the question, is the context the integer will be used in. In the original example above, I doubt you'd see much difference between 8, 32 or 64 bit values if the array is large and sparse since you'll be hitting memory bandwidth limits before cpu limits.

The main point is, the architecture does not define what size the various integer types are, it's the compiler designer that does that. The designer will carefully weigh up the pros and cons for various sizes for each type for a given architecture and pick the most appropriate.

I guess the 32 bit int on the 64 bit system was chosen because for most operations ints are used for 32 bits are enough. Since memory bandwidth is a limiting factor, saving on memory use was probably the overriding factor.

Accouter answered 13/9, 2010 at 9:41 Comment(2)

Sure, the total throughput might be the same, but I need those cycles for other things. – Cashbook 13/9, 2010 at 10:14

@Matt: I need those cycles for other things - hmmm, unless you're hand coding the assembler to interleave several tasks at the instruction level, those cycles are going to be wasted, the OS doesn't task switch at that frequency/granularity. – Accouter 13/9, 2010 at 18:3

Let's read what the C17 ISO standard has to say about the "fast" types of stdint.h:

7.20.1.3 Fastest minimum-width integer types

1 Each of the following types designates an integer type that is usually fastest²⁶⁶⁾ to operate with among all integer types that have at least the specified width.

^{²⁶⁶⁾The designated type is not guaranteed to be fastest for all purposes; if the implementation has no clear grounds for
choosing one type over another, it will simply pick some integer type satisfying the signedness and width requirements.}

On the most widely used x86-64 architecture almost all integer instructions run comparably fast. Therefore the compilers normally picks intX_t for int_fastX_t. Indeed, if you don't need more than X bits, why allocate more? In other words, the meant "speed" of the integer type here is its internal CPU performance.

Your case though mostly reads memory and does very little CPU processing on it. For this purpose using an integer type with the memory bus width makes more sense. Nowadays it's 64 bits, so int64_t is the type. The C standard doesn't have a type for "memory bus width" int, so if you were trying to write an architecture-independent solution, intmax_t is your best bet. Just make sure not to hit a buffer overflow.

Huynh answered 6/4 at 8:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags