In C, why is "signed int" faster than "unsigned int"?
Asked Answered
M

6

42

In C, why is signed int faster than unsigned int? True, I know that this has been asked and answered multiple times on this website (links below). However, most people said that there is no difference. I have written code and accidentally found a significant performance difference.

Why would the "unsigned" version of my code be slower than the "signed" version (even when testing the same number)? (I have a x86-64 Intel processor).

Similar links

Compile Command: gcc -Wall -Wextra -pedantic -O3 -Wl,-O3 -g0 -ggdb0 -s -fwhole-program -funroll-loops -pthread -pipe -ffunction-sections -fdata-sections -std=c11 -o ./test ./test.c && strip --strip-all --strip-unneeded --remove-section=.note --remove-section=.comment ./test


signed int version

NOTE: There is no difference if I explicitly declare signed int on all numbers.

int isprime(int num) {
    // Test if a signed int is prime
    int i;
    if (num % 2 == 0 || num % 3 == 0)
        return 0;
    else if (num % 5 == 0 || num % 7 == 0)
        return 0;
    else {
        for (i = 11; i < num; i += 2) {
            if (num % i == 0) {
                if (i != num)
                    return 0;
                else
                    return 1;
            }
        }
    }
    return 1;
}

unsigned int version

int isunsignedprime(unsigned int num) {
    // Test if an unsigned int is prime
    unsigned int i;
    if (num % (unsigned int)2 == (unsigned int)0 || num % (unsigned int)3 == (unsigned int)0)
        return 0;
    else if (num % (unsigned int)5 == (unsigned int)0 || num % (unsigned int)7 == (unsigned int)0)
        return 0;
    else {
        for (i = (unsigned int)11; i < num; i += (unsigned int)2) {
            if (num % i == (unsigned int)0) {
                if (i != num)
                    return 0;
                else
                    return 1;
            }
        }
    }
    return 1;
}

Test this in a file with the below code:

int main(void) {
    printf("%d\n", isprime(294967291));
    printf("%d\n", isprime(294367293));
    printf("%d\n", isprime(294967293));
    printf("%d\n", isprime(294967241)); // slow
    printf("%d\n", isprime(294967251));
    printf("%d\n", isprime(294965291));
    printf("%d\n", isprime(294966291));
    printf("%d\n", isprime(294963293));
    printf("%d\n", isprime(294927293));
    printf("%d\n", isprime(294961293));
    printf("%d\n", isprime(294917293));
    printf("%d\n", isprime(294167293));
    printf("%d\n", isprime(294267293));
    printf("%d\n", isprime(294367293)); // slow
    printf("%d\n", isprime(294467293));
    return 0;
}

Results (time ./test):

Signed - real 0m0.949s
Unsigned - real 0m1.174s
Menopause answered 8/12, 2015 at 20:9 Comment(25)
It may just be due to the overhead of all of the explicit casting.Marthena
did you tried to explicit cast to signed also ?Eerie
Have a look #4712815Naresh
This is probably processor specific, no?Desberg
Good points, everyone. Let me try explicitly casting "signed".Menopause
The first thing I'd do is compare the generated assembly code for both cases, and see if any additional instructions are being emitted in the unsigned case.Script
Also, are you compiling with -O enabled?Script
When writing benchmarks I strongly suggest to increase the measurement time to at least 30s and to run the task executing the benchmark with highest priority possible. Otherwise your measured 20% difference may be cause by a great deal by the OS.Hyson
I updated the question with additional information. Explicitly declaring "signed int" made no difference. @TomKarzes , my compiler flags were posted prior to any edits.Menopause
@Cᴏʀʏ they're free casts anyway. They tell the compiler to change the type, but cost no code to implement.Jamajamaal
Aside from the systematic error pointed out by @LukasThomsen, there is no evidence of a statistically significant difference.Turf
Both assembly code only differ by (un)signed arithmetic (idivl vs divl) on my machine, but execution time is roughly the same OP showed. This difference is then due to these instructions... Is there any border effect due to instruction parallelism, out-of-order execution, etc?Bifilar
What's the int size? If it's 32 bits, there could be some overhead for sign-extending 32-bit constants to 64-bit registers. But taking a look at the generated assembly would reveal that. Take a debugger and see.Sextan
@LukasThomsen and Rhymoid , I tested the code with more numbers (all unique), and "unsigned int" is still slower than "signed int". I added enough numbers to see a 50% difference. You had a good idea, though.Menopause
For example on Haswell, 32bit div (unsigned) takes one more µop (10 vs 9) and has a slightly worse (on average) throughput compared to 32bit idiv (signed), but that seems like too small a difference to explain thisJamajamaal
No this can explain it, the observed difference is not so huge and as I said the code only differ (for me) on signedness. What is strange is that when I changed int to long then results are the converse! It is much longer for signed than for unsigned (almost the same proportion than the original test).Bifilar
@Jean-BaptisteYunès usually the 64bit idiv is a lot worse than the 64bit div. But the difference in time here is over 20% right, while the difference in throughput between 32bit idiv and div is nowhere near thatJamajamaal
(unsigned int)3 can be more clearly written 3uSafko
To solve this you should diff the generated assembly code.Safko
By the way, there is no cost in writing 2U which is unsigned literal, making (unsigned) cast unnecessary (and more readable code).Singlebreasted
(signed)2 is exactly the same as 2 because integer literals must be int unless it doesn't fit in an int. Same to (unsigned)2 and 2UBrazier
Your code is wrong because it'll return false for 2, 3, 5, 7. And it's extremely inefficient because it checks for all odd divisors up to num whereas this at least could be optimized down to only factors in the form 6k±1 of num up until sqrt(num) which significantly reduces the number of loops neededBrazier
A bit of pedantry: if you write int x, you are not explicitly declaring a signed int. You are declaring it implicitly. To be explicit, you would write signed int x. To explicitly declare x signed, and implicitly declare it an integer, you would write signed x.Matthus
Any performance difference is due to the compiler or test method and not due to C In C, why is “signed int” faster than “unsigned int”? is not the true question here.Herald
Have you tried running your test with the -fwrapv option?Koger
P
39

Your question is genuinely intriguing as the unsigned version consistently produces code that is 10 to 20% slower. Yet there are multiple problems in the code:

  • Both functions return 0 for 2, 3, 5 and 7, which is incorrect.
  • The test if (i != num) return 0; else return 1; is completely useless as the loop body is only run for i < num. Such a test would be useful for the small prime tests but special casing them is not really useful.
  • the casts in the unsigned version are redundant.
  • benchmarking code that produces textual output to the terminal is unreliable, you should use the clock() function to time CPU intensive functions without any intervening I/O.
  • the algorithm for prime testing is utterly inefficient as the loop runs num / 2 times instead of sqrt(num).

Let's simplify the code and run some precise benchmarks:

#include <stdio.h>
#include <time.h>

int isprime_slow(int num) {
    if (num % 2 == 0)
        return num == 2;
    for (int i = 3; i < num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int unsigned_isprime_slow(unsigned int num) {
    if (num % 2 == 0)
        return num == 2;
    for (unsigned int i = 3; i < num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int isprime_fast(int num) {
    if (num % 2 == 0)
        return num == 2;
    for (int i = 3; i * i <= num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int unsigned_isprime_fast(unsigned int num) {
    if (num % 2 == 0)
        return num == 2;
    for (unsigned int i = 3; i * i <= num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int main(void) {
    int a[] = {
        294967291, 0, 294367293, 0, 294967293, 0, 294967241, 1, 294967251, 0,
        294965291, 0, 294966291, 0, 294963293, 0, 294927293, 1, 294961293, 0,
        294917293, 0, 294167293, 0, 294267293, 0, 294367293, 0, 294467293, 0,
    };
    struct testcase { int (*fun)(); const char *name; int t; } test[] = {
        { isprime_slow, "isprime_slow", 0 },
        { unsigned_isprime_slow, "unsigned_isprime_slow", 0 },
        { isprime_fast, "isprime_fast", 0 },
        { unsigned_isprime_fast, "unsigned_isprime_fast", 0 },
    };

    for (int n = 0; n < 4; n++) {
        clock_t t = clock();
        for (int i = 0; i < 30; i += 2) {
            if (test[n].fun(a[i]) != a[i + 1]) {
                printf("%s(%d) != %d\n", test[n].name, a[i], a[i + 1]);
            }
        }
        test[n].t = clock() - t;
    }
    // Using CLOCKS_PER_SEC to convert CPU time to milliseconds:
    // on Windows CLOCKS_PER_SEC is 1000,
    // on Linux and macOS, CLOCKS_PER_SEC is usually 1000000,
    // printf("CLOCKS_PER_SEC: %ld\n", (long)CLOCKS_PER_SEC);
    for (int n = 0; n < 4; n++) {
        double ms = test[n].t * 1000.0 / CLOCKS_PER_SEC;
        printf("%21s: %.3fms\n", test[n].name, ms);
    }
    return 0;
}

The code compiled with clang -O2 on OS/X produces this output:

         isprime_slow:  788.004ms
unsigned_isprime_slow:  965.381ms
         isprime_fast:    0.065ms
unsigned_isprime_fast:    0.089ms

These timings are consistent with the OP's observed behavior on a different system, but show the dramatic improvement caused by the more efficient iteration test: 10000 times faster!

Regarding the question Why is the function slower with unsigned?, let's look at the generated code (gcc 7.2 -O2):

isprime_slow(int):
        ...
.L5:
        movl    %edi, %eax
        cltd
        idivl   %ecx
        testl   %edx, %edx
        je      .L1
.L4:
        addl    $2, %ecx
        cmpl    %esi, %ecx
        jne     .L5
.L6:
        movl    $1, %edx
.L1:
        movl    %edx, %eax
        ret

unsigned_isprime_slow(unsigned int):
        ...
.L19:
        xorl    %edx, %edx
        movl    %edi, %eax
        divl    %ecx
        testl   %edx, %edx
        je      .L22
.L18:
        addl    $2, %ecx
        cmpl    %esi, %ecx
        jne     .L19
.L20:
        movl    $1, %eax
        ret
       ...
.L22:
        xorl    %eax, %eax
        ret

The inner loops are very similar, same number of instructions, similar instructions. Here are however some potential explanations:

  • cltd extends the sign of the eax register into the edx register, which may be causing an instruction delay because eax is modified by the immediately preceeding instruction movl %edi, %eax. Yet this would make the signed version slower than the unsigned one, not faster.
  • the loops' initial instructions might be misaligned for the unsigned version, but it is unlikely as changing the order in the source code has no effect on the timings.
  • Although the register contents are identical for the signed and unsigned division opcodes, it is possible that the idivl instruction take fewer cycles than the divl instruction. Indeed the signed division operates on one less bit of precision than the unsigned division, but the difference seems quite large for this small change.
  • I suspect more effort was put into the silicon implementation of idivl because signed divisions are more common that unsigned divisions (as measured by years of coding statistics at Intel).
  • as commented by rcgldr, looking at instruction tables for Intel process, for Ivy Bridge, DIV 32 bit takes 10 micro ops, 19 to 27 cycles, IDIV 9 micro ops, 19 to 26 cycles. The benchmark times are consistent with these timings. The extra micro-op may be due to the longer operands in DIV (64/32 bits) as opposed to IDIV (63/31 bits).

This surprising result should teach us a few lessons:

  • optimizing is a difficult art, be humble and procrastinate.
  • correctness is often broken by optimizations.
  • choosing a better algorithm beats optimization by a long shot.
  • always benchmark code, do not trust your instincts.
Puglia answered 31/8, 2017 at 22:3 Comment(6)
Why not precompute sqrt(m) rounded down (float cast to unsigned int, say, once and use that as the upper limit in the iteration? That's faster than testing i * i <= mFideicommissum
@HennoBrandsma: it may be faster or not faster, only benchmarking can tell... Multiplication is quite fast on modern CPUs. There is also a precision issue for large values of num. Last and not least, I was trying to have very simple functions so the assembly code would stay simple too.Puglia
This answer seems to have the most actual information, but it still doesn't clear things up for me.Viipuri
@MichaelBurr: I suspect more effort was put into the silicon implementation of idivl because signed divisions are more common that unsigned divisions (as measured by years of coding statistics at Intel), but I have no proof.Puglia
Looking at instruction tables for Intel process, for Ivy Bridge, DIV 32 bit takes 10 micro ops, 19 to 27 cycles, IDIV 9 micro ops, 19 to 26 cycles. On WIndows XP (my only 32 bit OS), Intel 3770K 3.5 ghz, Visual Studio, the fast times are 0.048 ms for int, 0.065 ms for unsigned int.Reproach
@GunnarBernstein: thank you for the edit, portability is a must. Note that CLOCKS_PER_SEC has type clock_t, which is not necessarily compatible with long.Puglia
K
16

Because signed integer overflow is undefined, the compiler can make a lot of assumptions and optimizations on code involving signed integers. Unsigned integer overflow is defined to wrap around, so the compiler won't be able to optimize as much. See also http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html#signed_overflow and http://www.airs.com/blog/archives/120.

Koger answered 31/8, 2017 at 13:1 Comment(2)
Your statements are correct, but looking at the code generated by x86 compilers, it does not seem to be a pertinent explanation.Puglia
Your statements looks right but it seems to be answering the reverse question. So wrong at last.Chimerical
C
3

From Instruction specification on AMD/Intel we have (for K7):

Instruction Ops Latency Throughput
DIV r32/m32 32  24      23
IDIV r32    81  41      41
IDIV m32    89  41      41 

For i7, latency and throughput are the same for IDIVL and DIVL, a slight difference exists for the µops.

This may explain the difference as -O3 assembly codes only differ by signedness (DIVL vs IDIVL) on my machine.

Cerelia answered 8/12, 2015 at 20:57 Comment(4)
Hold on, that's the wrong way around: this would make unsigned fasterJamajamaal
But this table says it should be faster for 32bit unsigned, which it wasn't (but of course OP probably doesn't have a K7 anymore)Jamajamaal
These may be worst case latency, for something like negative operands. Does your measured performance change when operands are negative?Atrocious
the unsigned version uses div which will have to be fasterBrazier
B
0

I took the liberty to run the code provided by chqrlie on my Intel Coffee Lake i7-i8700k CPU.

in C: Results running on Windows (clock_t measures in ms),
Compiled with Clang with Visual Studio Code:
         isprime_slow: 652.000ms
unsigned_isprime_slow: 645.000ms
         isprime_fast: 0.000ms
unsigned_isprime_fast: 0.000ms

in C: Results running on Windows (clock_t measures in ms),
Compiled with Intel API with Visual Studio:
         isprime_slow: 534.000ms
unsigned_isprime_slow: 536.000ms
         isprime_fast: 0.000ms
unsigned_isprime_fast: 0.000ms

Result: unsigned is comparable fast.

in C: Results running on WSL (Ubuntu) under Windows,
Compiled with Clang with Visual Studio Code:
         isprime_slow: 538.425ms
unsigned_isprime_slow: 600.475ms
         isprime_fast: 0.041ms
unsigned_isprime_fast: 0.037ms

Result: unsigned is slower on Linux with Clang. This seems to be an issue with the Clang compiler as the signed code has comparable speed to windows compiler. Interestingly, the fast solution is even faster then the signed version, even with Clang.

To check, if the compiler is the issue, I have rewritten the Code in Rust. In Rust: Results running on Windows, rewritten in Rust: isprime_slow: 530.434ms unsigned_isprime_slow: 530.490ms isprime_fast: 0.045ms unsigned_isprime_fast: 0.038ms

In Rust: Results running on WSL, rewritten in Rust:
         isprime_slow: 523.833ms
unsigned_isprime_slow: 519.132ms
         isprime_fast: 0.040ms
unsigned_isprime_fast: 0.040ms

Now using a different compiler with an equally fast language the results match the Intel Compiler.

As such: The speed for unsigned and signed only has to do on how it was compiled.

Rust Code

use std::time::Instant;

struct Testcase {
    f: &'static dyn Fn(i32) -> bool,
    name: &'static str,
    duration_ms: f64,
}

impl Testcase {
    pub fn do_test(&self, number: i32) -> bool {
        (self.f)(number)
    }
}

fn is_prime_signed_slow_i32(number: i32) -> bool {
    let num = number;
    if num % 2 == 0 {
        return num == 2; // returns true for 2 which is a prime, else false
    }
    for i in (3..num).step_by(2) {
        if num % i == 0 {
            return false;
        }
    }
    return true;
}

fn is_prime_signed_fast_while_i32(number: i32) -> bool {
    let num = number;
    if num % 2 == 0 {
        return num == 2;
    }
    let mut i: i32 = 3;
    while i * i <= num {
        if num % i == 0 {
            return false;
        }
        i += 2;
    }
    return true;
}

fn is_prime_unsigned_slow_u32(number: i32) -> bool {
    //println!("slow");
    let num = number as u32;
    if num % 2 == 0 {
        return num == 2;
    }
    for i in (3..num).step_by(2) {
        if num % i == 0 {
            return false;
        }
    }
    return true;
}

fn is_prime_unsigned_fast_while_u32(number: i32) -> bool {
    // println!("while");
    let num = number as u32;
    if num % 2 == 0 {
        return num == 2;
    }
    let mut i: u32 = 3;
    while i * i <= num {
        if num % i == 0 {
            return false;
        }
        i += 2;
    }
    return true;
}

fn main() {
    println!("Testing Prime Number Check\n");

    let mut testcases = [
        Testcase {
            f: &is_prime_signed_slow_i32,
            name: "isprime_slow",
            duration_ms: 0.0,
        },
        Testcase {
            f: &is_prime_unsigned_slow_u32,
            name: "unsigned_isprime_slow",
            duration_ms: 0.0,
        },
        Testcase {
            f: &is_prime_signed_fast_while_i32,
            name: "isprime_fast",
            duration_ms: 0.0,
        },
        Testcase {
            f: &is_prime_unsigned_fast_while_u32,
            name: "unsigned_isprime_fast",
            duration_ms: 0.0,
        },
    ];

    let primes: [i32; 30] = [
        294967291, 0, 294367293, 0, 294967293, 0, 294967241, 1, 294967251, 0, 294965291, 0,
        294966291, 0, 294963293, 0, 294927293, 1, 294961293, 0, 294917293, 0, 294167293, 0,
        294267293, 0, 294367293, 0, 294467293, 0,
    ];

    // get max len testcase for formatting
    let mut max_len = 0;
    for t in &testcases {
        let len = t.name.len();
        if len > max_len {
            max_len = len;
        }
    }

    // run testcases
    for testcase in testcases.iter_mut() {
        let start = Instant::now();
        for i in (0..primes.len()).step_by(2) {
            let is_prime = testcase.do_test(primes[i]);
            if is_prime as i32 != primes[i + 1] {
                println!("Error in {} for prime id {i}: {}", testcase.name, primes[i]);
            }
        }
        testcase.duration_ms = start.elapsed().as_micros() as f64 / 1000.0;
        println!("{:>max_len$} = {:.3}ms", testcase.name, testcase.duration_ms);
    }

}
Breskin answered 14/1 at 8:1 Comment(2)
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From ReviewPaling
To the people deleting this answer. Please explain your reasoning. The original question asks why the code for unsigned is slower. This answers it. It is the compiler.Breskin
H
-1

Alternative wiki candidate test that may/may not show a significant time difference.

#include <stdio.h>
#include <time.h>

#define J 10
#define I 5

int main(void) {
  clock_t c1,c2,c3;
  for (int j=0; j<J; j++) {
    c1 = clock();
    for (int i=0; i<I; i++) {
      isprime(294967241);
      isprime(294367293);
    }
    c2 = clock();
    for (int i=0; i<I; i++) {
      isunsignedprime(294967241);
      isunsignedprime(294367293);
    }
    c3 = clock();
    printf("%d %d %d\n", (int)(c2-c1), (int)(c3-c2), (int)((c3-c2) - (c2-c1)));
    fflush(stdout);
  }
  return 0;
}

Sample output

2761 2746 -15
2777 2777 0
2761 2745 -16
2793 2808 15
2792 2730 -62
2746 2730 -16
2746 2730 -16
2776 2793 17
2823 2808 -15
2793 2823 30
Herald answered 8/12, 2015 at 20:9 Comment(1)
Thank you for contributing to the Stack Overflow community. This may be a correct answer, but it’d be really useful to provide additional explanation of your code so developers can understand your reasoning. This is especially useful for new developers who aren’t as familiar with the syntax or struggling to understand the concepts. Would you kindly edit your answer to include additional details for the benefit of the community?Thinner
A
-1

In fact in many cases unsigned is faster than signed for eample

  • In dividing by a power of 2
    unsigned int x=37;
    cout<<x/4;
  • In checking if a number if even
    unsigned int x=37;
    cout<<(x%2==0)?"even":"odd";
Aleron answered 18/9, 2022 at 15:53 Comment(3)
"in many cases unsigned is faster than signed" do you have evidence for this claim? Your examples only show how to use unsigned ints, but your answer does not show anything to do with performance and thus does not really answer the question.Shornick
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Celestial
@ThomasFlinkow For bechmarking check this answerAleron

© 2022 - 2024 — McMap. All rights reserved.