Why would uint32_t be preferred rather than uint_fast32_t?
Asked Answered
W

11

88

It seems that uint32_t is much more prevalent than uint_fast32_t (I realise this is anecdotal evidence). That seems counter-intuitive to me, though.

Almost always when I see an implementation use uint32_t, all it really wants is an integer that can hold values up to 4,294,967,295 (usually a much lower bound somewhere between 65,535 and 4,294,967,295).

It seems weird to then use uint32_t, as the 'exactly 32 bits' guarantee is not needed, and the 'fastest available >= 32 bits' guarantee of uint_fast32_t seem to be exactly the right idea. Moreover, while it's usually implemented, uint32_t is not actually guaranteed to exist.

Why, then, would uint32_t be preferred? Is it simply better known or are there technical advantages over the other?

Wurth answered 26/10, 2017 at 16:46 Comment(14)
Simple answer, maybe they need an integer that has exactly 32 bit ?Easygoing
First I've heard of uint32_fast_t, which if I'm understanding correctly, is at least 32 bits (meaning it could be more? Sounds misleading to me). I'm currently using uint32_t and friends on my project because I'm packing up this data and sending it over network, and I want the sender and receiver to know exactly how big the fields are. Sounds like this may not be the most robust solution since a platform may not implement uint32_t, but all of mine do apparently so I'm fine with what I'm doing.Griddle
@yano: For networking, you should also care about byte order/endianess - uint32_t doesn't give you that (and it's a pity there's no uint32_t_be and uint32_t_le, which would be more appropriate for almost every possible case where uint32_t is currently the best option).Attention
@Attention whoops read your comment incorrectly, you're saying there are not _be and _le types. I agree, such types would be ideal for networking applications. I currently only have 2 target systems, and they're both little endian, so endianess hasn't been an issue and as such I've decided to brush it under the rug for the moment. Maybe that will come back to bite me later, but accounting for endianess shouldn't be a terrible amount of re/additional work.Griddle
On 64-bit platforms using gcc uint_fast32_t is likely defined as uint64_t. This can be a gotcha if you expect uint_fast32_t to behave like a 32-bit type, and also using fast types blindly and having a 64-bit type for every variable is likely to have negative performance characteristics.Photofinishing
@Attention Yes, that is usually where I see uint32_t - in some attempts at portability. I end up forcing a byte order by converting to a byte array before outputting them anyway, though. That was the main reason I figured the exactly 32bit requirement is often useless.Wurth
When I need some types that have at least N bits, I chose appropriately from short,int,long, long longVaporing
@Attention - with regards to _be and _le, would htonl() and ntohl() provide that same capability?Steadman
@mpez0: Sort of; but you can't put htonl() in a structure (e.g. like struct myPacketFormat { uint32_t_le sequenceNumber; ... } ) so you end up with htonl() and friends scattered everywhere (except for that one place where you forgot that takes you 4 days to find). ;-)Attention
@Attention that's a pretty heavyweight object to hide in a standard int all of which are primitive types. I agree with you in principle that this should be handled in the standard somewhere but I think this might not be the placeHeald
@Steadman No, htonl/ntohl convert to/from BE when you really want to convert to/from LE. All modern protocols use LE over the network because x86 won and LE can still be handled efficiently by BE processors (but not the other way around)Pippo
@Chuu, if using a 64-bit type has negative performance characteristics, then gcc would be wrong to use it as a definition for uint_fast32_t. Are you sure that both of your statements are accurate?Smegma
Realistically, it's because uint_fast32_t is more typing.University
Why is fast arithmetic important? Lots of programs don't do any significant amount of arithmetic.Liebknecht
O
85

uint32_t is guaranteed to have nearly the same properties on any platform that supports it.1

uint_fast32_t has very little guarantees about how it behaves on different systems in comparison.

If you switch to a platform where uint_fast32_t has a different size, all code that uses uint_fast32_t has to be retested and validated. All stability assumptions are going to be out the window. The entire system is going to work differently.

When writing your code, you may not even have access to a uint_fast32_t system that isn't 32 bits in size.

uint32_t won't work differently (see footnote).

Correctness is more important than speed. Premature correctness is thus a better plan than premature optimization.

In the event I was writing code for systems where uint_fast32_t was 64 or more bits, I might test my code for both cases and use it. Barring both need and opportunity, doing so is a bad plan.

Finally, uint_fast32_t when you are storing it for any length of time or number of instances can be slower than uint32 simply due to cache size issues and memory bandwidth. Todays computers are far more often memory-bound than CPU bound, and uint_fast32_t could be faster in isolation but not after you account for memory overhead.


1 As @chux has noted in a comment, if unsigned is larger than uint32_t, arithmetic on uint32_t goes through the usual integer promotions, and if not, it stays as uint32_t. This can cause bugs. Nothing is ever perfect.

Outwardly answered 26/10, 2017 at 17:34 Comment(7)
"uint32_t is guaranteed to have the same properties on any platform that supports it." There is a corner problem when unsigned is wider than uint32_t and then uint32_t on one platform goes through the usual integer promotions and on another it does not. Yet with uint32_t these integer math problem are significantly reduced.Markswoman
@chux a corner case that can cause UB when multiplying, because promotion prefers signed int and signed integer overflow is UB.Hathaway
Although this answer is correct as far as it goes, it very much downplays the key details. In a nutshell, uint32_t is for where the exact details of the machine representation of the type are important, whereas uint_fast32_t is for where computational speed is most important, (un)signedness and minimum range are important, and details of representation are non-essential. There is also uint_least32_t for where (un)signedness and minimum range are most important, compactness is more important than speed, and exact representation is not essential.Lillis
@JohnBollinger Which is all well and good, but without testing on actual hardware that implements more than 1 variant, the variable size types are a trap. And the reason why people use uint32_t rather than the other types is because they usually don't have such hardware to do testing on. (The same is true of int32_t to a lesser extent, and even int and short).Outwardly
An example of the corner case: Let unsigned short==uint32_t and int==int48_t. If you compute something like (uint32_t)0xFFFFFFFF * (uint32_t)0xFFFFFFFF, then the operands are promoted to signed int and will trigger a signed integer overflow, which is undefined behavior. See this question.Mezzorilievo
Correctness is more important than speed, true. But types like uint32_t simply fail to be correct for public APIs where the exact magic numbers like "32" are actually not in the specification (which should be the common cases for most software developers until something evil like ABI/some wire formats are concerned). Introducing fixed-width machine integers too early is a common bug of leaking the implementation details, at the cost of reducing portability and raising the risks of hiding bugs related to misinterpretation of the specification.Hymanhymen
Also note types like uint32_t are not only used for arithmetic operations; it can also serve to other uses, notably, bitmasks. In these cases, leaking the width too early can incur even more problems. Sadly, you have no choices for languages simply provide things like uint32_t to do the dirty works. That said, uint32_fast_t or uint32_least_t are still more correct than uint32_t in many cases.Hymanhymen
M
35

Why do many people use uint32_t rather than uint32_fast_t?

Note: Mis-named uint32_fast_t should be uint_fast32_t.

uint32_t has a tighter specification than uint_fast32_t and so makes for more consistent functionality.


uint32_t pros:

  • Various algorithms specify this type. IMO - best reason to use.
  • Exact width and range known.
  • Arrays of this type incur no waste.
  • unsigned integer math with its overflow is more predictable.
  • Closer match in range and math of other languages' 32-bit types.
  • Never padded.

uint32_t cons:

  • Not always available (yet this is rare in 2018).
    E.g.: Platforms lacking 8/16/32-bit integers (9/18/36-bit, others).
    E.g.: Platforms using non-2's complement. old 2200

uint_fast32_t pros:

  • Always available.
    This always allow all platforms, new and old, to use fast/minimum types.
  • "Fastest" type that support 32-bit range.

uint_fast32_t cons:

  • Range is only minimally known. Example, it could be a 64-bit type.
  • Arrays of this type may be wasteful in memory.
  • All answers (mine too at first), the post and comments used the wrong name uint32_fast_t. Looks like many just don't need and use this type. We didn't even use the right name!
  • Padding possible - (rare).
  • In select cases, the "fastest" type may really be another type. So uint_fast32_t is only a 1st order approximation.

In the end, what is best depends on the coding goal. Unless coding for very wide portability or some niched performance function, use uint32_t.


There is another issue when using these types that comes into play: their rank compared to int/unsigned

Presumably uint_fastN_t could be the rank of unsigned. This is not specified, but a certain and testable condition.

Thus, uintN_t is more likely than uint_fastN_t to be narrower the unsigned. This means that code that uses uintN_t math is more likely subject to integer promotions than uint_fastN_t when concerning portability.

With this concern: portability advantage uint_fastN_t with select math operations.


Side note about int32_t rather than int_fast32_t: On rare machines, INT_FAST32_MIN may be -2,147,483,647 and not -2,147,483,648. The larger point: (u)intN_t types are tightly specified and lead to portable code.

Markswoman answered 26/10, 2017 at 17:32 Comment(17)
Fastest type that support 32-bit range => really? This is a relic of a time when RAM was running at CPU speeds, nowadays the balance has shifted dramatically on PCs so (1) pulling 32-bits integers from memory is twice as fast as pulling 64-bits ones and (2) vectorized instructions on 32-bits integers crunch twice as many as they do on 64-bits ones. Is it still really the fastest?Exiguous
Fastest for some things, slower for other things. There's no one-size-fits-all answer to "what's the fastest size of integer" when you consider arrays vs. needing zero-extension. In the x86-64 System V ABI, uint32_fast_t is a 64-bit type, so it saves the occasional sign-extension and allows imul rax, [mem] instead of a separate zero-extending load instruction when using it with 64-bit integers or pointers. But that's all you get for the price of double the cache footprint and extra code-size (REX prefixed on everything.)Elbrus
Also, 64-bit division is much slower than 32-bit division on most x86 CPUs, and some (like Bulldozer-family, Atom, and Silvermont) have slower 64-bit multiply than 32. Bulldozer-family also has slower 64-bit popcnt. And remember, it's only safe to use this type for 32-bit values, because it's smaller on other architectures, so you're paying this cost for nothing.Elbrus
@PeterCordes comments well on "fastest type". The type behind uint32_fast_t certainly takes into account many things: such as processor characteristics, memory access, cache, vectorizing and the options selected at compile time. Of course a choice may not prove to be the fastest in every situation. In essence, it is the compiler's assessment and a likely good initial choice. For the select cases where the true fastest is needed requires code profiling.Markswoman
I would expect that as a weighted average over all C and C++ applications, making uint32_fast_t on x86 is a terrible choice. The operations that are faster are few and far between, and the benefit when they occur are mostly minuscule: the differences for the imul rax, [mem] case that @PeterCordes mentions are very, very small: a single uop in the fused domain and zero in the unfused domain. In most interesting scenarios it won't even add a single cycle. Balance that against double the memory use, and worse vectorization, it's hard to see it winning very often.Beaded
@chux CPU-tuning options don't change the ABI (i.e. -march=silvermont won't make uint32_fast_t a 32-bit type instead of 64-bit.) The #define is unsigned long int in stdint.h. Compilers need to agree with each other on the size if you want to link objects from different compilers (or the same compiler with different options!). It's not a compiler missed-optimization, it's a case of the ABI designers (i.e. compiler devs) choosing poorly (IMO) nearly 20 years ago. Too much weight on (scalar) instruction count, not enough on cache footprint and auto-vectorization.Elbrus
Actually, the fast_t types aren't specified in the x86-64 SysV psABI document. I think they're only established by glibc's stdint.h. Of course, gcc itself could take control of the type for whole-program optimization. That would actually be interesting if it really did give the compiler license to use 32 or 64 bit integers in different places (e.g. when used as a loop counter or return value vs. an array.) i.e. a 32-bit type that the compiler can widen instead of requiring it to wrap at 32 bits.Elbrus
@PeterCordes - interesting but also terrible :). It would make fast_t an even worse int: not only does it have different sizes on different platforms, but it would have different sizes depending on optimization decisions and different sizes in different files! As a practical matter, I think it can't work even with whole program optimization: sizes in C and C++ are fixed so sizeof(uint32_fast_t) or anything which determines it even directly has to always return the same value, so it would very tough for the compiler to make such a transformation.Beaded
Beyond that, it's not just the sizes in the SysV ABI that have to be respected by gcc: in generally any type should have the same representation for code to be compatible. Since even LTO and FPO can't exclude the possibility of a dlopen at runtime (reasonably), it has to use a consistent representation mostly (again for things like locals that can't escape I guess it can do whatever it wants: _but that's true of any type, including uint32_t). I guess that's the key point: for any use of a type that is provably "private" the compiler is already free to full optimize.Beaded
So outside of arithmetic semantics, the concrete properties (size, mostly) of types are useful largely for guaranteed compatibility between stuff in different compilation units, and compilers have to make the same static decisions in that regard as they do for say int.Beaded
@BeeOnRope: If there was a type like that, the compiler would have to disallow it in any function signatures or data types exposed outside of LTO / whole-program optimization. You're right that C/C++ don't have anything remotely like that, but if they did it would not be appropriate (or hopefully not even legal) to use it for anything except local temporaries or static function args/returns and static arrays. for any use of a type that is provably "private"...: no, the compiler still has to zero-extend and implement wraparound at 32 bits, or use 64-bit ops always to avoid truncation.Elbrus
"Presumably uint_fastN_t would be at least the rank of unsigned." I do not think this is the case. On many, i assume most or even all, 8 bit CPUs is this true: sizeof(uint8_t)==sizeof(uint_least8_t) && sizeof(uint_least8_t)<sizeof(unsigned), because unsigned is at least 16 bits.Exemplify
@Exemplify Comment is unclear. "Presumably uint_fastN_t would be at least the rank of unsigned." refers to fast types, yet your comment relates to least types. sizeof(uint8_t)==sizeof(uint_least8_t) is certainly always true when uint8_t exist, yet the quote is about uint_fastN_t.Markswoman
@chux-ReinstateMonica On this platforms sizeof(uint8_t)==sizeof(uint_least8_t)&&sizeof(uint_least8_t)==sizeof(uint_fast8_t) On a 8 bit CPU a 16 bit int will always be slower than any of the 8 bit types.Exemplify
@Exemplify The issue in not about uint_least8_t. The issue and comment you posted as of concern is about uint_fastN_t such as uint_fast8_t. It remains unclear why your comments discuss uint_least8_t. uint_least8_t and uint_fast8_t are independently specified and need not be the same type.Markswoman
@chux-ReinstateMonica I mixed it up. The point is: uint_fast8_t is smaller than unsigned on many platforms.Exemplify
@Exemplify Agree uint_fast8_t could reasonably be smaller than unsigned. It could reasonably be the same as unsigned or something in between. It is a compiler choice.Markswoman
M
28

Why do many people use uint32_t rather than uint32_fast_t?

Silly answer:

  • There is no standard type uint32_fast_t, the correct spelling is uint_fast32_t.

Practical answer:

  • Many people actually use uint32_t or int32_t for their precise semantics, exactly 32 bits with unsigned wrap around arithmetic (uint32_t) or 2's complement representation (int32_t). The xxx_fast32_t types may be larger and thus inappropriate to store to binary files, use in packed arrays and structures, or send over a network. Furthermore, they may not even be faster.

Pragmatic answer:

  • Many people just don't know (or simply don't care) about uint_fast32_t, as demonstrated in comments and answers, and probably assume plain unsigned int to have the same semantics, although many current architectures still have 16-bit ints and some rare Museum samples have other strange int sizes less than 32.

UX answer:

  • Although possibly faster than uint32_t, uint_fast32_t is slower to use: it takes longer to type, especially accounting for looking up spelling and semantics in the C documentation ;-)

Elegance matters, (obviously opinion based):

  • uint32_t looks bad enough that many programmers prefer to define their own u32 or uint32 type... From this perspective, uint_fast32_t looks clumsy beyond repair. No surprise it sits on the bench with its friends uint_least32_t and such.
Mesosphere answered 26/10, 2017 at 18:50 Comment(1)
+1 for UX. It's better than std::reference_wrapper I guess, but sometimes I wonder if the standard committee really wants the types it standardizes to be used...Exiguous
E
8

One reason is that unsigned int is already "fastest" without the need for any special typedefs or the need to include something. So, if you need it fast, just use the fundamental int or unsigned int type.
While the standard does not explicitly guarantee that it is fastest, it indirectly does so by stating "Plain ints have the natural size suggested by the architecture of the execution environment" in 3.9.1. In other words, int (or its unsigned counterpart) is what the processor is most comfortable with.

Now of course, you don't know what size unsigned int might be. You only know it is at least as large as short (and I seem to remember that short must be at least 16 bits, although I can't find that in the standard now!). Usually it's just plain simply 4 bytes, but it could in theory be larger, or in extreme cases, even smaller (although I've personally never encountered an architecture where this was the case, not even on 8-bit computers in the 1980s... maybe some microcontrollers, who knows turns out I suffer from dementia, int was very clearly 16 bits back then).

The C++ standard doesn't bother to specify what the <cstdint> types are or what they guarantee, it merely mentions "same as in C".

uint32_t, per the C standard, guarantees that you get exactly 32 bits. Not anything different, none less and no padding bits. Sometimes this is exactly what you need, and thus it is very valuable.

uint_least32_t guarantees that whatever the size is, it cannot be smaller than 32 bits (but it could very well be larger). Sometimes, but much more rarely than an exact witdh or "don't care", this is what you want.

Lastly, uint_fast32_t is somewhat superfluous in my opinion, except for documentation-of-intent purposes. The C standard states "designates an integer type that is usually fastest" (note the word "usually") and explicitly mentions that it needs not be fastest for all purposes. In other words, uint_fast32_t is just about the same as uint_least32_t, which is usually fastest too, only no guarantee given (but no guarantee either way).

Since most of the time you either don't care about the exact size or you want exactly 32 (or 64, sometimes 16) bits, and since the "don't care" unsigned int type is fastest anyway, this explains why uint_fast32_t isn't so frequently used.

Elam answered 26/10, 2017 at 20:1 Comment(6)
I'm surprised you don't remember 16-bit int on 8-bit processors, I can't remember any from those days that used anything larger. If memory serves, compilers for segmented x86 architecture used 16-bit int as well.Sweeping
@MarkRansom: Wow, you are right. I was sooooo convinced that int was 32 bits on the 68000 (which I thought of, as an example). It was not...Elam
int was meant to be the fastest type in the past with a minimal width of 16 bits (this is why C has integer promotion rule), but today with 64-bit architectures this is not true anymore. For example 8 byte integers are faster than 4 byte integers on x86_64 bit because with 4 byte integers compiler has to insert additional instruction that expands 4 byte value into 8 byte value before comparing it with other 8 byte values.Calumny
"unsigned int" is not necessarily fastest on x64. Weird things happened.Wilburnwilburt
Another common case is that long, for historical reasons, needs to be 32-bit, and int is now required to be no wider than long, so int might need to stay 32-bit even when 64 bits would be faster.Llamas
"In other words, int (or its unsigned counterpart) is what the processor is most comfortable with." That is just plain wrong. int is the size which is at least 16 bit and of this types the one which is best for the current architecture. 8 bit would be the most "natural" and fastest integer on all 8 bit CPUs, which probably still are the most sold processors by numbers, and there a unsigned is 16 bit.Exemplify
S
6

I have not seen evidence that uint32_t be used for its range. Instead, most of the time that I've seen uint32_t is used, it is to hold exactly 4 octets of data in various algorithms, with guaranteed wraparound and shift semantics!

There are also other reasons to use uint32_t instead of uint_fast32_t: Often it is that it will provide stable ABI. Additionally the memory usage can be known accurately. This very much offsets whatever the speed gain would be from uint_fast32_t, whenever that type would be distinct from that of uint32_t.

For values < 65536, there is already a handy type, it is called unsigned int (unsigned short is required to have at least that range as well, but unsigned int is of the native word size) For values < 4294967296, there is another called unsigned long.


And lastly, people do not use uint_fast32_t because it is annoyingly long to type and easy to mistype :D

Stoss answered 26/10, 2017 at 17:23 Comment(9)
@ikegami: you changed my intent with the short edit. int is presumably the fast one when it is distinct from short.Malloy
Your last sentence is completely wrong, then. Claiming that you should use unsigned int instead of uint16_fast_t means you claim to know better than the compiler.Orange
Also, my apologies for changing the intent of your text. That wasn't my intent.Orange
unsigned long is not a good choice if your platform has 64-bit longs and you only need numbers <2^32.Ionopause
@ikegami: The type "unsigned int" will always behave as an unsigned type, even when promoted. In this regard it is superior to both uint16_t and uint_fast16_t. If uint_fast16_t were more loosely specified than normal integer types, such that its range need not be consistent for objects whose addresses aren't taken, that could offer some performance benefits on platforms which perform 32-bit arithmetic internally but have a 16-bit data bus. The Standard does not allow for such flexibility, however.Precedent
@ikegami: The fact that uint16_t may promote to a signed type, and uint_fast16_t might do so as well would have been unlikely to matter until about ten years ago, when compiler writers decided that since the Standard doesn't require that a function like uint_fast16_t mul_mod_65535(uint_fast16_t x, uint_fast16_t y) { return (x*y) & 0xFFFF; } should behave sensibly for all values of x and y, even on 32-bit systems, they should "optimize" based on the fact that x*y will never exceed 2147483647.Precedent
@supercat, That's not relevant to what I said. There might be reasons to use an unsigned int, but Antii said one should use unsigned int instead of unsigned short because the former is faster. My comment was directed at that. If you simply want faster, you want uint_fast16_t. If there are other considerations at play, you may want something else, but that's not what Antii said.Orange
@ikegami: From a performance standpoint, there's unlikely to be any particular advantage to using individual objects of type uint_fast16_t versus unsigned; any decision to favor one or the other should thus be based on other criteria, not on a notion that one "knows better than the compiler". With regard to aggregates, however, programmers often will "know better", since it's common for individual 32-bit objects to be faster than 16-bit objects, but aggregates of 16-bit objects to be faster than aggregates of 32-bit objects.Precedent
@supercat, Re "any decision to favor one or the other should thus be based on other criteria", Again, irrelevant to what I said. You should tell that to Amitt instead, since they're the one that claimed you should decide base on speed.Orange
D
6

Several reasons.

  1. Many people don't know the 'fast' types exist.
  2. It's more verbose to type.
  3. It's harder to reason about your programs behaviour when you don't know the actual size of the type.
  4. The standard doesn't actually pin down fastest, nor can it really what type is actually fastest can be very context dependent.
  5. I have seen no evidence of platform developers putting any thought into the size of these types when defining their platforms. For example on x86-64 Linux the "fast" types are all 64-bit even though x86-64 has hardware support for fast operations on 32-bit values.

In summary the "fast" types are worthless garbage. If you really need to figure out what type is fastest for a given application you need to benchmark your code on your compiler.

Disseminate answered 26/10, 2017 at 22:25 Comment(3)
Historically there have been processors that had 32-bit and/or 64-bit memory access instructions but not 8- & 16-bit. So int_fast{8,16}_t would have been not-quite-entirely-stupid 20+ years ago. AFAIK the last such mainstream processor was the original DEC Alpha 21064 (the second generation 21164 got improved). Probably there are still embedded DSPs or whatever that only do word accesses, but portability isn't normally a great concern on such things, so I don't see why you'd cargo-cult fast_t on those. And there were hand-built Cray "everything is 64-bit" machines.Insatiable
Category 1b: Many people don't care that the 'fast' types exist. That's my category.Ettaettari
Category 6: Many people don't trust that the 'fast' types are the fastest. I belong in that category.Enrollment
B
6

From the viewpoint of correctness and ease of coding, uint32_t has many advantages over uint_fast32_t in particular because of the more precisely defined size and arithmetic semantics, as many users above have pointed out.

What has perhaps been missed is that the one supposed advantage of uint_fast32_t - that it can be faster, just never materialized in any meaningful way. Most of the 64-bit processors that have dominated the 64-bit era (x86-64 and Aarch64 mostly) evolved from 32-bit architectures and have fast 32-bit native operations even in 64-bit mode. So uint_fast32_t is just the same as uint32_t on those platforms.

Even if some of the "also ran" platforms like POWER, MIPS64, SPARC only offer 64-bit ALU operations, the vast majority of interesting 32-bit operations can be done just fine on 64-bit registers: the bottom 32-bit will have the desired results (and all mainstream platforms at least allow you to load/store 32-bits). Left shift is the main problematic one, but even that can be optimized in many cases by value/range tracking optimizations in the compiler.

I doubt the occasional slightly slower left shift or 32x32 -> 64 multiplication is going to outweigh double the memory use for such values, in all but the most obscure applications.

Finally, I'll note that while the tradeoff has largely been characterized as "memory use and vectorization potential" (in favor of uint32_t) versus instruction count/speed (in favor of uint_fast32_t) - even that isn't clear to me. Yes, on some platforms you'll need additional instructions for some 32-bit operations, but you'll also save some instructions because:

  • Using a smaller type often allows the compiler to cleverly combine adjacent operations by using one 64-bit operation to accomplish two 32-bit ones. An example of this type of "poor man's vectorization" is not uncommon. For example, create of a constant struct two32{ uint32_t a, b; } into rax like two32{1, 2} can be optimized into a single mov rax, 0x20001 while the 64-bit version needs two instructions. In principle this should also be possible for adjacent arithmetic operations (same operation, different operand), but I haven't seen it in practice.
  • Lower "memory use" also often leads to fewer instructions, even if memory or cache footprint isn't a problem, because any type structure or arrays of this type are copied, you get twice the bang for your buck per register copied.
  • Smaller data types often exploit better modern calling conventions like the SysV ABI which pack data structure data efficiently into registers. For example, you can return up to a 16-byte structure in registers rdx:rax. For a function returning structure with 4 uint32_t values (initialized from a constant), that translates into

    ret_constant32():
        movabs  rax, 8589934593
        movabs  rdx, 17179869187
        ret
    

    The same structure with 4 64-bit uint_fast32_t needs a register move and four stores to memory to do the same thing (and the caller will probablyhave to read the values back from memory after the return):

    ret_constant64():
        mov     rax, rdi
        mov     QWORD PTR [rdi], 1
        mov     QWORD PTR [rdi+8], 2
        mov     QWORD PTR [rdi+16], 3
        mov     QWORD PTR [rdi+24], 4
        ret
    

    Similarly, when passing structure arguments, 32-bit values are packed about twice as densely into the registers available for parameters, so it makes it less likely that you'll run out of register arguments and have to spill to the stack1.

  • Even if you choose to use uint_fast32_t for places where "speed matters" you'll often also have places where you need a fixed size type. For example, when passing values for external output, from external input, as part of your ABI, as part of a structure that needs a specific layout, or because you smartly use uint32_t for large aggregations of values to save on memory footprint. In the places where your uint_fast32_t and ``uint32_t` types need to interface, you might find (in addition to the development complexity), unnecessary sign extensions or other size-mismatch related code. Compilers do an OK job at optimizing this away in many cases, but it still not unusual to see this in optimized output when mixing types of different sizes.

You can play with some of the examples above and more on godbolt.


1 To be clear, the convention of packing structures tightly into registers isn't always a clear win for smaller values. It does mean that the smaller values may have to be "extracted" before they can be used. For example a simple function that returns the sum of the two structure members together needs a mov rax, rdi; shr rax, 32; add edi, eax while for the 64-bit version each argument gets its own register and just needs a single add or lea. Still if you accept that the "tightly pack structures while passing" design makes sense overall, then smaller values will take more advantage of this feature.

Beaded answered 28/10, 2017 at 5:9 Comment(21)
glibc on x86-64 Linux uses 64-bit uint_fast32_t, which is a mistake IMO. (Apparently Windows uint_fast32_t is a 32-bit type on Windows.) Being 64-bit on x86-64 Linux is why I would never recommend that anyone use uint_fast32_t: it's optimized for low instruction count (function args and return values never need zero-extension for use as an array index) not for overall speed or code-size on one of the major important platforms.Elbrus
Oh right, I read your comment above about the SysV ABI, but as you point out later maybe it was a different group/document that decided it - but I guess once that happens it's pretty much set in stone. I think it is even questionable that pure cycle count/instruction count favors larger types even ignoring memory footprint effects and vectorization, even on platforms without good 32-bit operation support - because there are still cases where the compiler can optimize better the smaller types. I added some examples above. @PeterCordesBeaded
SysV packing multiple struct members into the same register costs more instructions fairly often when returning a pair<int,bool> or pair<int,int>. If both members aren't compile-time constants, there's usually more than just an OR, and the caller has to unpack the return values. (bugs.llvm.org/show_bug.cgi?id=34840 LLVM optimizes the return-value passing for private functions, and should treat 32-bit int as taking the whole rax so the bool is separate in dl instead of needing a 64-bit constant to test it.)Elbrus
IDK how it comes out on average. Obviously in any given case, a 32 or 64-bit might lead to better code depending on the fine details, and using different types for temporaries vs. arrays can help, too. It's much easier to get this optimal when writing by hand in asm because you can just decide that out-of-range values don't need to be handled. But for C you always have to pick something and then the compiler has to make code that works for any possible value of that type. If you really want to optimize, you need to use different types at different places to store/pass the same value.Elbrus
@PeterCordes - yes, I mentioned that the tighter packing isn't always a win in footnote 1, although I was talking about parameter passing (but the exact same thing applies for return values). As I said: if you think the SysV packing is smart idea then presumably it works out better. I have no idea if SysV packing is actually better though. They could have defined it to be smart: if your arguments will fit in the six registers "unpacked" (i.e., one value/member per reg), then do that. If not, fall back to the existing algorithm.Beaded
This only uses the stack in the same cases as today, but when there are 6 or less values they'll be passed in regs unpacked, which probably overall generates better code. Win win. Of course, it's a weird convention that is incompatible with varargs, and weird effects like passing the 7th argument totally changing the packing: not sure if allowing "backwards compatible" functions where new code calls with N+1 arguments an old function that is only expecting N and then the callee just ignores the extra arg - cause this would break that.Beaded
Overall my point is that even the purported benefit of 64-bit values: better code generation, to be traded off against more memory use is far from assured. In fact, more than the calling convention stuff (you'd expect most small functions where it matters to be inlined) I think things like the inline code generated for struct copying is a pretty big deal and may outweigh the other stuff. So at a pure codegen level I think you can say it might be close to a tie. Add memory footprint and IMO it's not close on average. Yes, you can do even better by selectively choosing between the two...Beaded
It's a win in some cases, but not others. It's great if the receiving end will eventually just store it somewhere in memory (after it figures out where), if the struct layout has no padding other than trailing. For a return value, passing a pointer for the callee to do the store might be even better, but that changes the func signature (and isn't what you want if you do want to use the values). As usual for performance, the best choice depends on context. I think it's a reasonable ABI choice, and just sucks for some structs. If you want to use the low 32-bit member first, great.Elbrus
I wish I had data for SPECint performance (and any relevant corner cases in other common software) with ABI changes like passing structs in registers using their memory layout (including padding). Or returning them in up to 4 registers. Or as you suggested, with each member in a separate register. (Agreed that the fallback to packing when you run out of regs is not feasible for a real ABI. Although hopefully not much real code depends on calling non-prototyped functions that aren't actually defined as variadic.) IDK how to change the ABI in gcc / clang, or if they'd make optimal choices.Elbrus
Not that it would be plausible to adopt a new ABI at this point for x86-64, but I think x86-64 SysV is generally a good ABI, and probably a better design than MS's (especially the default __fastcall instead of __vectorcall). But I'd like to know if my intuition is correct, because I haven't measured! Presumably in real code (tiny functions rare after inlining, unless you build without LTO...) the shadow space gets used for spills and it doesn't generally lead to stack wastage. Having 10 call-preserved XMM regs is far too much, I think, but 0 is too few. 1 or 2 would be nice for structsElbrus
Yes SysV seems pretty reasonable and the MS ones seem crap. Shadow space seems strictly worse than redzone, but maybe they have some technical constraint (e.g., like the dumb thing that ended up with legacy xmm operations in AVX not clearing the high part of the ymm reg that we'll be stuck with for 50 years). Probably the static calling conventions are not super important, since so much is subject to inlining and IPA which removes the constraints: at least if you have a case where it is important and own both parts of the code you can often do something about it...Beaded
...such as forcing struct members into regs, passing pointers or whatever is the best in your case. The main concern is probably where you don't own code on one side of the call interface (usually the callee) and there are small functions that are performance critical. That probably doesn't happen too much - the biggest case would probably be C standard library functions, but compilers have been intrinsifying all the interesting ones. C++ stdlib stuff is largely all in the headers.Beaded
That's good in theory, but does their compiler (or anyone elses) really do aggressive calling-convention IPA for private functions? Clang does a little bit (omitting struct members the caller doesn't use, even for packed-register), but IDK if it ever goes much further (like using rcx as a 3rd return slot, or returning non-trivially-copyable objects in registers if the copy-constructor doesn't do anything important.) BTW, the one case where shadow-space is good is for varargs. MS's ABI is optimized for variadic functions with shadow space and int/fp competing for reg-arg slots...Elbrus
@PeterCordes What's good in theory? IPA and inlining? Inlining definitely works to remove all the constraints of the calling convention, and IPA is less important here since if a function is not inlined, it is probably not small, so the relative effect of the calling convention is probably small, but I mentioned it to be complete because I know that IPA is used to optimize some things outside of the bounds of the calling convention. I haven't heard of using extra regs to return values, or changing from caller storage + rax pointer to reg return, but haven't tested either.Beaded
Good point that inlining usually makes it not a big factor. Except with small callbacks I guess. C++ templates have the advantage over libc qsort there. Many Linux distros (and build scripts in source) don't enable LTO or PGO, unfortunately, which can be really bad for projects where coding standards discourage inline definitions in headers. I meant IPA to overcome inefficiencies in standard calling conventions is good in theory, but gcc and clang don't do it aggressively for SysV. (Maybe impossible to test Windows ABI on Godbolt, because __attribute__((ms_abi)) would probably ).Elbrus
Well I guess it depends on your point of view. gcc certainly claims to have some pretty aggressive IPA optimizations (read the [description](ipa-ra) of the *-ipa* flags), which includes things like totally dropping parameters if unused (which you mentioned) or if always called with a constant, and even compiling two (or more?) versions of a function if it needs a general one and a special one where args are constants. It can change reference params to pass by value, it can use callee-clobbered regs across a call anyways if it knows they aren't clobbered, etc.Beaded
So it doesn't necessary exact address the "calling convention" cases we were discussing like "better to pack two 32-bit values in a reg or pass in two regs" and similar, but it's magic that goes even beyond what a calling convention could do in many cases. If the call/return path is really hot and the function is not inlined, I guess it could be because the function is big (statically) but dynamically most of the function is not executed. There you could peel out the hot part into another function and have an explicit slowpath. I don't know if compilers can yet do anything similar.Beaded
I think choosing a custom calling convention that trades off optimality considering all callers is a harder problem than just finding function parameters or return values that no callers use. The search space of possible calling conventions is large. At that point you're kind of turning 3 functions (2 callers + 1 common callee) into a giant shared function with 2 entry points. And worse when the callee isn't a leaf function. Anyway yes powerful IPA optimization is a thing, but customizing the calling convention isn't something it does very much.Elbrus
Yes, probably. My claim is not that it's an easier problem, but that the other IPA optimizations are "more powerful" in terms of their effect. Messing with the calling convention has a small bounded benefit per call since the "worst" case is not actually that bad. Doing IPA that lets you recompile a function totally differently based on knowledge about the call sites has unbounded potential. Note that our last two replies raced (in case you didn't see the one right above yours, and addresses a bit the question of "calling convention vs non-calling convention IPA".Beaded
I think compilers generally don't split functions. Peeling out a fast-path as a separate function is a useful source-level optimization (especially in a header where it can inline). Can be very good if 90% of inputs are the "do nothing case"; doing that filtering in the caller's loop is a big win. IIRC, Linux uses __attribute__((noinline)) exactly to make sure that gcc doesn't inline the error-handling function and put a bunch of push rbx / ... / pop rbx / ... on the fast path of some important kernel functions that have many callers and don't themselves inline.Elbrus
In Java it's really important too because inlining is so key to further optimizations (especially de-virtualization which is pervasive unlike C++), so it often pays to split out a fast-path there, and "bytecode optimization" is actually a thing (despite the conventional wisdom that it makes no sense because the JIT does the final compile) just to get the bytecode count down since inlining decisions are based on bytecode size, not inlined machine code size (and the correlation can vary by orders of magnitude).Beaded
C
5

To my understanding, int was initially supposed to be a "native" integer type with additional guarantee that it should be at least 16 bits in size - something that was considered "reasonable" size back then.

When 32-bit platforms became more common, we can say that "reasonable" size has changed to 32 bits:

  • Modern Windows uses 32-bit int on all platforms.
  • POSIX guarantees that int is at least 32 bits.
  • C#, Java has type int which is guaranteed to be exactly 32 bits.

But when 64-bit platform became the norm, no one expanded int to be a 64-bit integer because of:

  • Portability: a lot of code depends on int being 32 bit in size.
  • Memory consumption: doubling memory usage for every int might be unreasonable for most cases, as in most cases numbers in use are much smaller than 2 billion.

Now, why would you prefer uint32_t to uint_fast32_t? For the same reason languages, C# and Java always use fixed size integers: programmer does not write code thinking about possible sizes of different types, they write for one platform and test code on that platform. Most of the code implicitly depends on specific sizes of data types. And this is why uint32_t is a better choice for most cases - it does not allow any ambiguity regarding its behavior.

Moreover, is uint_fast32_t really the fastest type on a platform with a size equal or greater to 32 bits? Not really. Consider this code compiler by GCC for x86_64 on Windows:

extern uint64_t get(void);

uint64_t sum(uint64_t value)
{
    return value + get();
}

Generated assembly looks like this:

push   %rbx
sub    $0x20,%rsp
mov    %rcx,%rbx
callq  d <sum+0xd>
add    %rbx,%rax
add    $0x20,%rsp
pop    %rbx
retq

Now if you change get()'s return value to uint_fast32_t (which is 4 bytes on Windows x86_64) you get this:

push   %rbx
sub    $0x20,%rsp
mov    %rcx,%rbx
callq  d <sum+0xd>
mov    %eax,%eax        ; <-- additional instruction
add    %rbx,%rax
add    $0x20,%rsp
pop    %rbx
retq

Notice how generated code is almost the same except for additional mov %eax,%eax instruction after function call which is meant to expand 32-bit value into a 64-bit value.

There is no such issue if you only use 32-bit values, but you will probably be using those with size_t variables (array sizes probably?) and those are 64 bits on x86_64. On Linux uint_fast32_t is 8 bytes, so the situation is different.

Many programmers use int when they need to return small value (let's say in the range [-32,32]). This would work perfectly if int would be platforms native integer size, but since it is not on 64-bit platforms, another type which matches platform native type is a better choice (unless it is frequently used with other integers of smaller size).

Basically, regardless of what standard says, uint_fast32_t is broken on some implementations anyway. If you care about additional instruction generated in some places, you should define your own "native" integer type. Or you can use size_t for this purpose, as it will usually match native size (I am not including old and obscure platforms like 8086, only platforms that can run Windows, Linux etc).


Another sign that shows int was supposed to be a native integer type is "integer promotion rule". Most CPUs can only perform operations on native, so 32 bit CPU usually can only do 32-bit additions, subtractions etc (Intel CPUs are an exception here). Integer types of other sizes are supported only through load and store instructions. For example, the 8-bit value should be loaded with appropriate "load 8-bit signed" or "load 8-bit unsigned" instruction and will expand value to 32 bits after load. Without integer promotion rule C compilers would have to add a little bit more code for expressions that use types smaller than native type. Unfortunately, this does not hold anymore with 64-bit architectures as compilers now have to emit additional instructions in some cases (as was shown above).

Calumny answered 27/10, 2017 at 11:31 Comment(10)
Thoughts about "no one expanded int to be 64 bit integer because" and "Unfortunately, this does not hold anymore with 64 bit architectures" are very good points . To be fair though about "fastest" and comparing assembly code: In this case it appears the 2nd code snippet is slower with its extra instruction, yet code length and speed sometimes are not so well correlated. A stronger compare would report the run times - yet that is not so easy to do.Markswoman
I don't it will be easy to measure slowness of the 2nd code, Intel CPU might be doing a really good job, but longer code means large cache pollution too. Single instruction once in a while probably does not hurt, but usefulness of uint_fast32_t becomes ambiguous.Calumny
I strongly agree about usefulness of uint_fast32_t becomes ambiguous, in all but very select circumstances. I suspect the driving reason for uint_fastN_t at all is to accommodate the "let's not use unsigned as 64-bit, even though it often fastest on new platform, because too much code will break" but "I still want a fast at least N-bit type." I'd UV you again if I could.Markswoman
Most 64-bit architectures can easily operate on 32-bit integers. Even DEC Alpha (which was a branch-new 64-bit architecture rather than an extension to an existing 32-bit ISA like PowerPC64 or MIPS64) had 32 and 64-bit loads/stores. (But not byte or 16-bit loads/stores!). Most instructions were 64-bit only, but it had native HW support for 32-bit add/sub and multiply that truncate the result to 32 bits. (alasir.com/articles/alpha_history/press/alpha_intro.html) So there'd be almost no speed gain from making int 64 bit, and usually a speed loss from cache footprint.Elbrus
Also, if you made int 64-bit, your uint32_t fixed-width typedef would need an __attribute__ or other hack, or some custom type that's smaller than int. (Or short, but then you have the same problem for uint16_t.) Nobody wants that. 32-bit is wide enough for almost everything (unlike 16-bit); using 32-bit integers when that's all you need is not "inefficient" in any meaningful way on a 64-bit machine.Elbrus
@PeterCordes You missed few sentences. Loading words of non-natives size is not the same as actually operating on them. Most cpu extend valeu to a native size. As for speed gain, post explicitly shows how compiler has to emit additional instructions on 64-bit cpu. As for cache footprint, it is irrelevant if you need a datatype to use a function return value. As my example show, int (or int32_t) has an overhead of additiona instruction. And this is why uint_fast32_t is 8 byte on x86_64 on Linux. Because it is faster.Calumny
As for sizes of native data types - they could have added short short which would be similar to long long, but small. The only reason why int was not expanded is backward compatibility. Leaving it as it is doesn't really break anything, changing it does. The only thing it breaks is C's "integer promotion rule" which now forces compilers to generate suboptimal code in some cases. But generates code is again, not that bad.Calumny
It saves an occasional instruction at call/return boundaries so it's faster in some cases. But it's slower in other cases: multiply throughput/latency on Bulldozer-family and on Silvermont/KNL, and for throughput with SIMD (half the elements per vector). It also requires REX prefixes which increase code size (and thus indirectly slow you down). Not to mention the cache footprint cost for storing them in memory (structs / arrays)Elbrus
Fair point, you could have 32-bit short and 16-bit short short.Elbrus
You get prefix, but you don't get additional instructions. As for memory occupied by uint_fast32_t, I think it is irrelevant. You sould be using all bytes of uint_fast32_t. So if it is 8 bytes, you are not wasting memory, as you otherwise would use two uint32_t. If you are using 8-byte uint_fast32_t to store 4 bytes of data, why not just use uint32_t? As for vectorization, I don't see use for uint_fast32_t there. Imo you know exact size of integers you are working with when you do something like that. And this is very CPU specific anyway.Calumny
M
4

For practical purposes, uint_fast32_t is completely useless. It's defined incorrectly on the most widespread platform (x86_64), and doesn't really offer any advantages anywhere unless you have a very low-quality compiler. Conceptually, it never makes sense to use the "fast" types in data structures/arrays - any savings you get from the type being more efficient to operate on will be dwarfed by the cost (cache misses, etc.) of increasing the size of your working data set. And for individual local variables (loop counters, temps, etc.) a non-toy compiler can usually just work with a larger type in the generated code if that's more efficient, and only truncate to the nominal size when necessary for correctness (and with signed types, it's never necessary).

The one variant that is theoretically useful is uint_least32_t, for when you need to be able to store any 32-bit value, but want to be portable to machines that lack an exact-size 32-bit type. Practically, speaking, however, that's not something you need to worry about.

Mortise answered 29/10, 2017 at 0:46 Comment(0)
L
3

In many cases, when an algorithm works on an array of data, the best way to improve performance is to minimize the number of cache misses. The smaller each element, the more of them can fit into the cache. This is why a lot of code is still written to use 32-bit pointers on 64-bit machines: they don’t need anything close to 4 GiB of data, but the cost of making all pointers and offsets need eight bytes instead of four would be substantial.

There are also some ABIs and protocols specified to need exactly 32 bits, for example, IPv4 addresses. That’s what uint32_t really means: use exactly 32 bits, regardless of whether that’s efficient on the CPU or not. These used to be declared as long or unsigned long, which caused a lot of problems during the 64-bit transition. If you just need an unsigned type that holds numbers up to at least 2³²-1, that’s been the definition of unsigned long since the first C standard came out. In practice, though, enough old code assumed that a long could hold any pointer or file offset or timestamp, and enough old code assumed that it was exactly 32 bits wide, that compilers can’t necessarily make long the same as int_fast32_t without breaking too much stuff.

In theory, it would be more future-proof for a program to use uint_least32_t, and maybe even load uint_least32_t elements into a uint_fast32_t variable for calculations. An implementation that had no uint32_t type at all could even declare itself in formal compliance with the standard! (It just wouldn’t be able to compile many existing programs.) In practice, there’s no architecture any more where int, uint32_t, and uint_least32_t are not the same, and no advantage, currently, to the performance of uint_fast32_t. So why overcomplicate things?

Yet look at the reason all the 32_t types needed to exist when we already had long, and you’ll see that those assumptions have blown up in our faces before. Your code might well end up running someday on a machine where exact-width 32-bit calculations are slower than the native word size, and you would have been better off using uint_least32_t for storage and uint_fast32_t for calculation religiously. Or if you’ll cross that bridge when you get to it and just want something simple, there’s unsigned long.

Llamas answered 27/10, 2017 at 1:31 Comment(3)
But there are architectures where int is not 32 bits, for example ILP64. Not that they're common.Malloy
I don’t think ILP64 exists in the present tense? Several webpages claim that “Cray” uses it, all of which cite the same Unix.org page from 1997, but UNICOS in the mid-’90s actually did something weirder and today’s Crays use Intel hardware. That same page claims that ETA supercomputers used ILP64, but they went out of business a long time ago. Wikipedia claims that HAL’s port of Solaris to SPARC64 used ILP64, but they’ve also been out of business for years. CppReference says that ILP64 was only used in a few early 64-bit Unices. So it’s relevant only to some very esoteric retrocomputing.Llamas
Note that if you use the “ILP64 interface” of Intel’s Math Kernel Library today, int will be 32 bits wide. The type MKL_INT is what will change.Llamas
R
2

To give a direct answer: I think the real reason why uint32_t is used over uint_fast32_t or uint_least32_t is simply that it is easier to type, and, due to being shorter, much nicer to read: If you make structs with some types, and some of them are uint_fast32_t or similar, then it's often hard to align them nicely with int or bool or other types in C, which are quite short (case in point: char vs. character). I of course cannot back this up with hard data, but the other answers can only guess at the reason as well.

As for technical reasons to prefer uint32_t, I don't think there are any - when you absolutely need an exact 32 bit unsigned int, then this type is your only standardised choice. In almost all other cases, the other variants are technically preferable - specifically, uint_fast32_t if you are concerned about speed, and uint_least32_t if you are concerned about storage space. Using uint32_t in either of these cases risks not being able to compile as the type is not required to exist.

In practise, the uint32_t and related types exist on all current platforms, except some very rare (nowadays) DSPs or joke implementations, so there is little actual risk in using the exact type. Similarly, while you can run into speed penalties with the fixed-width types, they are (on modern cpus) not crippling anymore.

Which is why, I think, the shorter type simply wins out in most cases, due to programmer lazyness.

Rockefeller answered 30/7, 2018 at 7:9 Comment(3)
Well, there are some inconsistency... or tradeoffs. If ease of typing is that important, why not u32 instead of uint32_t? I don't know whether is is more difficult to read...Hymanhymen
There is no u32 type in C, so it is not an option. Neither is it an option in the question. However, a lot of code does define an U32 type, presumably, again, for ease of typing at the cost of maintaining a header of a typedef somewhere.Rockefeller
Since u32 is not a reserved identifier or keyword, you can always use typedef u32 uint32_t;, at the cost that the declaration may clash with others' code. If you don't want the cost of maintaining the declaration, propose it to JTC1/SC22/WG14, just like pre-C99 era, when there was also no uint32_t. It is unlikely the pure synonym would be adopted, though. IIRC u32 was already popular decades ago, so the choice to put uint32_t instead of u32 into the standard seems also relevant here.Hymanhymen

© 2022 - 2024 — McMap. All rights reserved.