Performance of built-in types : char vs short vs int vs. float vs. double
Asked Answered
H

9

81

Seeing Alexandre C's reply in the other topic, I'm curious to know that if there is any performance difference with the built-in types:

char vs short vs int vs. float vs. double.

Usually we don't consider such performance difference (if any) in our real life projects, but I would like to know this for educational purpose. The general questions can be asked is:

  • Is there any performance difference between integral arithmetics and floating-point arithmetic?

  • Which is faster? What is the reason for being faster? Please explain this.

Householder answered 21/2, 2011 at 18:2 Comment(5)
Profile, and measure. Use very large quantities of iterations.Benz
@Thomas Matthews: That can answer my one question : which is faster. But not "why is faster".Householder
Plus of course, integer types and floating point types are good for very different things. I can think of few situations where I would consider both acceptable.Gaunt
@achelper If you are programming for a device without a FPU then it can be worthwhile to sacrifice accuracy and programmer time to convert an algorithm from floating point to integer (with appropriate scale factors).Rubetta
For what its worth: I wrote a program in c with a number of arithmetic operations and a number of memory swaps (too complex to post here). I did compile with different data types and the differences are very small. On my Intel Coffee Lake I got these execution times for 6.2 billion (!) function calls on a single thread in seconds (all signed): char: 12.76s, short: 12.31s, int: 11.87s, long 11.86s, long long 11.60s. All operations on stack. I believe the size of data moved to and from memory will have much more impact on the performance than the CPU working with the data.Hollins
G
137

Float vs. integer:

Historically, floating-point could be much slower than integer arithmetic. On modern computers, this is no longer really the case (it is somewhat slower on some platforms, but unless you write perfect code and optimize for every cycle, the difference will be swamped by the other inefficiencies in your code).

On somewhat limited processors, like those in high-end cell phones, floating-point may be somewhat slower than integer, but it's generally within an order of magnitude (or better), so long as there is hardware floating-point available. It's worth noting that this gap is closing pretty rapidly as cell phones are called on to run more and more general computing workloads.

On very limited processors (cheap cell phones and your toaster), there is generally no floating-point hardware, so floating-point operations need to be emulated in software. This is slow -- a couple orders of magnitude slower than integer arithmetic.

As I said though, people are expecting their phones and other devices to behave more and more like "real computers", and hardware designers are rapidly beefing up FPUs to meet that demand. Unless you're chasing every last cycle, or you're writing code for very limited CPUs that have little or no floating-point support, the performance distinction doesn't matter to you.

Different size integer types:

Typically, CPUs are fastest at operating on integers of their native word size (with some caveats about 64-bit systems). 32 bit operations are often faster than 8- or 16- bit operations on modern CPUs, but this varies quite a bit between architectures. Also, remember that you can't consider the speed of a CPU in isolation; it's part of a complex system. Even if operating on 16-bit numbers is 2x slower than operating on 32-bit numbers, you can fit twice as much data into the cache hierarchy when you represent it with 16-bit numbers instead of 32-bits. If that makes the difference between having all your data come from cache instead of taking frequent cache misses, then the faster memory access will trump the slower operation of the CPU.

Other notes:

Vectorization tips the balance further in favor of narrower types (float and 8- and 16-bit integers) -- you can do more operations in a vector of the same width. However, good vector code is hard to write, so it's not as though you get this benefit without a lot of careful work.

Why are there performance differences?

There are really only two factors that effect whether or not an operation is fast on a CPU: the circuit complexity of the operation, and user demand for the operation to be fast.

(Within reason) any operation can be made fast, if the chip designers are willing to throw enough transistors at the problem. But transistors cost money (or rather, using lots of transistors makes your chip larger, which means you get fewer chips per wafer and lower yields, which costs money), so chip designers have to balance how much complexity to use for which operations, and they do this based on (perceived) user demand. Roughly, you might think of breaking operations into four categories:

                 high demand            low demand
high complexity  FP add, multiply       division
low complexity   integer add            popcount, hcf
                 boolean ops, shifts

high-demand, low-complexity operations will be fast on nearly any CPU: they're the low-hanging fruit, and confer maximum user benefit per transistor.

high-demand, high-complexity operations will be fast on expensive CPUs (like those used in computers), because users are willing to pay for them. You're probably not willing to pay an extra $3 for your toaster to have a fast FP multiply, however, so cheap CPUs will skimp on these instructions.

low-demand, high-complexity operations will generally be slow on nearly all processors; there just isn't enough benefit to justify the cost.

low-demand, low-complexity operations will be fast if someone bothers to think about them, and non-existent otherwise.

Further reading:

  • Agner Fog maintains a nice website with lots of discussion of low-level performance details (and has very scientific data collection methodology to back it up).
  • The Intel® 64 and IA-32 Architectures Optimization Reference Manual (PDF download link is part way down the page) covers a lot of these issues as well, though it is focused on one specific family of architectures.
Goglet answered 21/2, 2011 at 18:17 Comment(5)
It still is much slower (for most math operations -- e.g. exclude MOV, etc) when talking about the op-code timings/throughput in isolation. I can't find the good empirical PDF I used to have though :(Knickerbocker
I like your complexity/demand table. It's really an excellent way to summarize it. +1Cheslie
@pst: only if you consider latency; throughput is a more meaningful measure, and a modern non-embedded CPU can do (at least) one FP multiply and add every cycle.Goglet
+1 Very true -- I was trying to emphasis that point, but you have done so better even if it doesn't read as direct.Knickerbocker
Terrific answer! Very well written and one of the best answers I've ever read on that topic. Even the links are great.Cletis
C
15

Absolutely.

First, of course, it depends entirely on the CPU architecture in question.

However, integral and floating-point types are handled very differently, so the following is nearly always the case:

  • for simple operations, integral types are fast. For example, integer addition often has only a single cycle's latency, and integer multiplication is typically around 2-4 cycles, IIRC.
  • Floating point types used to perform much slower. On today's CPUs, however, they have excellent throughput, and a each floating point unit can usually retire an operation per cycle, leading to the same (or similar) throughput as for integer operations. However, latency is generally worse. Floating-point addition often has a latency around 4 cycles (vs 1 for ints).
  • for some complex operations, the situation is different, or even reversed. For example, division on FP may have less latency than for integers, simply because the operation is complex to implement in both cases, but it is more commonly useful on FP values, so more effort (and transistors) may be spent optimizing that case.

On some CPUs, doubles may be significantly slower than floats. On some architectures, there is no dedicated hardware for doubles, and so they are handled by passing two float-sized chunks through, giving you a worse throughput and twice the latency. On others (the x86 FPU, for example), both types are converted to the same internal format 80-bit floating point, in the case of x86), so performance is identical. On yet others, both float and double have proper hardware support, but because float has fewer bits, it can be done a bit faster, typically reducing the latency a bit relative to double operations.

Disclaimer: all the mentioned timings and characteristics are just pulled from memory. I didn't look any of it up, so it may be wrong. ;)

For different integer types, the answer varies wildly depending on CPU architecture. The x86 architecture, due to its long convoluted history, has to support both 8, 16, 32 (and today 64) bit operations natively, and in general, they're all equally fast ( they use basically the same hardware, and just zero out the upper bits as needed).

However, on other CPUs, datatypes smaller than an int may be more costly to load/store (writing a byte to memory might have to be done by loading the entire 32-bit word it is located in, and then do bit masking to update the single byte in a register, and then write the whole word back). Likewise, for datatypes larger than int, some CPUs may have to split the operation into two, loading/storing/computing the lower and upper halves separately.

But on x86, the answer is that it mostly doesn't matter. For historical reasons, the CPU is required to have pretty robust support for each and every data type. So the only difference you're likely to notice is that floating-point ops have more latency (but similar throughput, so they're not slower per se, at least if you write your code correctly)

Cheslie answered 21/2, 2011 at 18:29 Comment(0)
D
12

I don't think anyone mentioned the integer promotion rules. In standard C/C++, no operation can be performed on a type smaller than int. If char or short happen to be smaller than int on the current platform, they are implicitly promoted to int (which is a major source of bugs). The complier is required to do this implicit promotion, there's no way around it without violating the standard.

The integer promotions mean that no operation (addition, bitwise, logical etc etc) in the language can occur on a smaller integer type than int. Thus, operations on char/short/int are generally equally fast, as the former ones are promoted to the latter.

And on top of the integer promotions, there's the "usual arithmetic conversions", meaning that C strives to make both operands the same type, converting one of them to the larger of the two, should they be different.

However, the CPU can perform various load/store operations on 8, 16, 32 etc level. On 8- and 16 bit architectures, this often means that 8 and 16 bit types are faster despite the integer promotions. On a 32 bit CPU it might actually mean that the smaller types are slower, because it wants to have everything neatly aligned in 32-bit chunks. 32 bit compilers typically optimize for speed and allocate smaller integer types in larger space than specified.

Though generally the smaller integer types of course take less space than the larger ones, so if you intend to optimize for RAM size, they are to prefer.

Delineator answered 21/2, 2011 at 21:10 Comment(7)
What you say is not really true. While it is true that integers must be promoted according to the standard that only tells half the story. C has an "as-if" rule so if you write something like uint8_t c = a + b, logically a and b are promoted then added then the upper bits are discarded but the compiler is free to implement it as an 8-bit addition since that will produce the same result.Rubetta
@Rubetta The compiler is only allowed do that optimization if it can ensure that the side effects of the promotion are still there. So if you have uint8_t b=255; and then do uint8_t a = (b + 1)/256; then the result must be 1 not 0. If you have uint8_t b; b << 15 the compiler must invoke undefined behavior in case int is 16 bits. And so on.Delineator
@Delineator waht does "the compiler must invoke undefined behavior" mean? The compiler is not obliged to invoke or do anything for code with undefined behaviour :)Iphlgenia
@JonathanWakely It means that if undefined behavior caused by this shift will result in a side effect, the compiler won't "optimize away" that side effect but invoke it. For example if the undefined behavior would result in data changes, the compiler might assume that the programmer is relying on that to happen. As far as the standard is concerned, this is of course out of scope, but compilers are deterministic. "Aha the programmer expects a program crash here. One program crash coming up!"Delineator
@Delineator that's not true at all. Often side effects being optimized away is precisely what happens if they arise from undefined behaviour. If you think you'll always get a crash when you expect one you're in for unpleasant surprises. Undefined behaviour means anything can happen.Iphlgenia
@JonathanWakely Well obviously it is different from case to case. If the undefined behavior of signed integer overflow manifests itself as some two's complement result, it is pretty safe to assume that the same compiler for the same system will deterministically do this every time. Even though the C standard allows it to go haywire. Anyway, these comments have nothing to do with the original topic so I'll stop here.Delineator
@Delineator no, that's really not safe to assume. That's not how modern compilers work. Detecting that overflow happens might depend on optimisation level, whether the function is inlined, the context the function is called in etc. etc. There are many variables involved, and it's not true that the same compiler will do the same thing every time.Iphlgenia
A
8

The first answer above is great and I copied a small block of it across to the following duplicate (as this is where I ended up first).

Are "char" and "small int" slower than "int"?

I'd like to offer the following code which profiles allocating, initializing and doing some arithmetic on the various integer sizes:

#include <iostream>

#include <windows.h>

using std::cout; using std::cin; using std::endl;

LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;

void inline showElapsed(const char activity [])
{
    QueryPerformanceCounter(&EndingTime);
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
    ElapsedMicroseconds.QuadPart *= 1000000;
    ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
    cout << activity << " took: " << ElapsedMicroseconds.QuadPart << "us" << endl;
}

int main()
{
    cout << "Hallo!" << endl << endl;

    QueryPerformanceFrequency(&Frequency);

    const int32_t count = 1100100;
    char activity[200];

    //-----------------------------------------------------------------------------------------//
    sprintf_s(activity, "Initialise & Set %d 8 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int8_t *data8 = new int8_t[count];
    for (int i = 0; i < count; i++)
    {
        data8[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 8 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data8[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    //-----------------------------------------------------------------------------------------//
    sprintf_s(activity, "Initialise & Set %d 16 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int16_t *data16 = new int16_t[count];
    for (int i = 0; i < count; i++)
    {
        data16[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 16 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data16[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    //-----------------------------------------------------------------------------------------//    
    sprintf_s(activity, "Initialise & Set %d 32 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int32_t *data32 = new int32_t[count];
    for (int i = 0; i < count; i++)
    {
        data32[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 32 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data32[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    //-----------------------------------------------------------------------------------------//
    sprintf_s(activity, "Initialise & Set %d 64 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int64_t *data64 = new int64_t[count];
    for (int i = 0; i < count; i++)
    {
        data64[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 64 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data64[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    getchar();
}


/*
My results on i7 4790k:

Initialise & Set 1100100 8 bit integers took: 444us
Add 5 to 1100100 8 bit integers took: 358us

Initialise & Set 1100100 16 bit integers took: 666us
Add 5 to 1100100 16 bit integers took: 359us

Initialise & Set 1100100 32 bit integers took: 870us
Add 5 to 1100100 32 bit integers took: 276us

Initialise & Set 1100100 64 bit integers took: 2201us
Add 5 to 1100100 64 bit integers took: 659us
*/

My results in MSVC on i7 4790k:

Initialise & Set 1100100 8 bit integers took: 444us
Add 5 to 1100100 8 bit integers took: 358us

Initialise & Set 1100100 16 bit integers took: 666us
Add 5 to 1100100 16 bit integers took: 359us

Initialise & Set 1100100 32 bit integers took: 870us
Add 5 to 1100100 32 bit integers took: 276us

Initialise & Set 1100100 64 bit integers took: 2201us
Add 5 to 1100100 64 bit integers took: 659us

Ablative answered 3/5, 2016 at 1:2 Comment(0)
B
2

Is there any performance difference between integral arithmetics and floating-point arithmetic?

Yes. However, this is very much platform and CPU specific. Different platforms can do different arithmetic operations at different speeds.

That being said, the reply in question was a bit more specific. pow() is a general purpose routine that works on double values. By feeding it integer values, it's still doing all of the work that would be required to handle non-integer exponents. Using direct multiplication bypasses a lot of the complexity, which is where the speed comes into play. This is really not an issue (so much) of different types, but rather of bypassing a large amount of complex code required to make pow function with any exponent.

Bergquist answered 21/2, 2011 at 18:5 Comment(5)
Please also reply to which is faster and why?... speed is difference can be guessed, as their representation are different. So the more interesting thing is to know the why?Householder
@Nawaz: It really depends on the platform. A lot has to do with register size and quantity of your architecture (en.wikipedia.org/wiki/Processor_register) - if your CPU only has 32bit registers, double math will likely be slow, since it can't be stored in a single register. However, 32bit integers will likely be very fast. The number and types make a huge difference, but there are many other issues... You see this much more in embedded system work, btw, because this tends to be VERY limited there compared to general purpose desktop computation...Bergquist
@Nawaz: How deep do you want to dig into? The logical circuit to perform most floating arithmetic is much more complex than its integer counterpart (of course, you might have a slow integer ALU and a fast FPU in some architecture, so complexity can be overcame with money... sometimes) That on the very low level, then on the high level, this answer is quite clear: you need to take less things into account. What is easier for you to calculate, x^2 or sqrt(x)? pow(x,0.5) is a square root, and that is more complex than a plain multiplication required for x^2.Indifference
@David: That is a good comment. I think you should post a detail answer, explaining this from logical circuit level upto the sqrt!Householder
@Nawaz: what you need is a book then. SO isn't really suited for novel-sized answers.Cheslie
D
2

Generally, integer math is faster than floating-point math. This is because integer math involves simpler computations. However, in most operations we're talking about less than a dozen clocks. Not millis, micros, nanos, or ticks; clocks. The ones that happen between 2-3 billion times per second in modern cores. Also, since the 486 a lot of cores have a set of Floating-point Processing Units or FPUs, which are hard-wired to perform floating-point arithmetic efficiently, and often in parallel with the CPU.

As a result of these, though technically it's slower, floating-point calculations are still so fast that any attempt to time the difference would have more error inherent in the timing mechanism and thread scheduling than it actually takes to perform the calculation. Use ints when you can, but understand when you can't, and don't worry too much about relative calculation speed.

Duer answered 21/2, 2011 at 18:16 Comment(1)
-1 Incorrect: "in most operations we're talking about less than a dozen clocks." most modern x86 CPUs can do arithmetic in 1-2 cycles (both integer and float). "since the 486 a lot of cores have a ...FPU" - actually, since the Pentium all x86 CPUs have FP hardware support.Dennett
B
1

Depends on the composition of the processor and platform.

Platforms that have a floating point coprocessor may be slower than integral arithmetic due to the fact that values have to be transferred to and from the coprocessor.

If floating point processing is within the core of the processor, the execution time may be negligible.

If the floating point calculations are emulated by software, then integral arithmetic will be faster.

When in doubt, profile.

Get the programming working correctly and robust before optimizing.

Benz answered 21/2, 2011 at 18:19 Comment(0)
N
0

No, not really. This of course depends on CPU and compiler, but the performance difference is typically negligible- if there even is any.

Niacin answered 21/2, 2011 at 18:9 Comment(1)
Depends on the situation. It is often negligible in everyday application code. But in high-performance numerical code, it can make a major difference. I can name at least one CPU where double addition is literally 14 times slower than int addition, which can definitely be felt in FP-heavy apps ;)Cheslie
B
0

There is certainly a difference between floating point and integer arithmetic. Depending on the CPU's specific hardware and micro-instructions, you get different performance and/or precision. Good google terms for the precise descriptions (I don't know exactly either):

FPU x87 MMX SSE

With regards to the size of the integers, it is best to use the platform/architecture word size (or double that), which comes down to an int32_t on x86 and int64_t on x86_64. SOme processors might have intrinsic instructions that handle several of these values at once (like SSE (floating point) and MMX), which will speed up parallel additions or multiplications.

Beamy answered 21/2, 2011 at 18:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.