qsort function compare confused me
Asked Answered
S

5

8

I see lots of people use subtraction in a qsort comparator function. I think it is wrong because when dealing with these numbers: int nums[]={-2147483648,1,2,3}; INT_MIN = -2147483648;

int compare (const void * a, const void * b)
{
  return ( *(int*)a - *(int*)b );
}

I wrote this function to test:

#include <stdio.h>
#include <limits.h>

int compare (const void * a, const void * b)
{
    return ( *(int*)a - *(int*)b );
}

int main(void)
{
    int a = 1;
    int b = INT_MIN;
    printf("%d %d\n", a,b);
    printf("%d\n",compare((void *)&a,(void *)&b));
    return 0;
}

The output is:

1 -2147483648
-2147483647

but a > b so the output should be positive。 I have seen many books write like this. I think it is wrong; it should be written like this when dealing with int types:

int compare (const void * a, const void * b)
{
    if(*(int *)a < *(int *)b)
        return -1;
    else if(*(int *)a > *(int *)b)
        return 1;
    else 
        return 0;
}

I just cannot figure out why many books and web sites write in such a misleading way. If you have any different view, please let me know.

Scop answered 5/4, 2018 at 3:37 Comment(9)
what is the link with qsort(), overflow of signal integer is undefined behavior, what did you expect ? there is a INT_MAX too and 1 + INT_MIN overflow.Facula
i want to know if i was wrong,i think just use - in compare is wrong,what you say should be 1+INT_MAX overflow?Scop
basic math, 1 - (-INT_MIN) == 1 + INT_MINFacula
@Facula you just got wrong,1-INT_MIN = 1+ -INT_MIN = 1 +2147483648 ,since INT_MAX = 2147483647,then overflowScop
you're right, usage of subtraction for comparison is wrong because of overflow, either cast to bigger type (long) or use standard if/elseWonderment
@Scop ???????????????????????????????????????????????????????????????Facula
@Stargateur: 1 + INT_MIN shouldn't overflow. 1 + INT_MAX overflows. INT_MIN - 1 overflows.Barefoot
@JonathanLeffler oh I mess up my equation... shame on me ^^ some how I develop INT_MIN to -2147483648 but I keep INT_MIN for some obscure reasonFacula
"I have seen many books write like this" - the error seems to be pervasive, here's example in gcc code.Raven
O
9

I think it is wrong

Yes, a simple subtraction can lead to int overflow which is undefined behavior and should be avoided.

return *(int*)a - *(int*)b;  // Potential undefined behavior.

A common idiom is to subtract two integer compares. Various compilers recognize this and create efficient well behaved code. Preserving const-ness also is good form.

const int *ca = a;
const int *cb = b;
return (*ca > *cb) - (*ca < *cb);

why many books and web sites write in such a misleading way.

return *a - *b; is conceptually easy to digest - even if it provides the wrong answer with extreme values - often learner code omits edge conditions to get the idea across - "knowing" that values will never be large.

Or consider the complexities of comparing long doubles with regard to NaN.

Owing answered 5/4, 2018 at 4:29 Comment(5)
As an added bonus, subtracting the integer compares gives neat values -1, 0 and 1Blurb
To avoid potential compiler warnings (which should be enabled to detect other problems), you should preserve the constness of the pointer arguments: return (*(const int*)a > *(const int*)b) - (*(const int*)a < *(const int*)b);Lecialecithin
@Lecialecithin Agree about const-ness.Owing
It might be simpler to use: int ca = *(const int *)a; int cb = *(const int *)b; return (ca > cb) - (ca < cb);, losing more of the inderections. I doubt if there is much difference when the optimizing compiler is done with the code, but it looks simpler to me.Barefoot
@JonathanLeffler Agreed. I used const int *ca = a; as I thought it was more step-wise illustrative for OP. It does avoid explicit casting.Owing
L
4

Your understanding is absolutely correct. This common idiom cannot be used for int values.

Your proposed solution works correctly, although it would be more readable with local variables to avoid so many casts:

int compare(const void *a, const void *b) {
    const int *aa = a;
    const int *bb = b;
    if (*aa < *bb)
        return -1;
    else if (*aa > *bb)
        return 1;
    else 
        return 0;
}

Note that modern compilers will generate the same code with or without these local variables: always prefer the more readable form.

A more compact solution with the same exact result is commonly used although a bit more difficult to understand:

int compare(const void *a, const void *b) {
    const int *aa = a;
    const int *bb = b;
    return (*aa > *bb) - (*aa < *bb);
}

Note that this approach works for all numeric types, but will return 0 for NaN floating point values.

As for your remark: I just cannot figure out why many books and web sites write in such a misleading way:

  • Many books and websites contain mistakes, and so do most programs. Many programming bugs get caught and squashed before they reach production if the program is tested wisely. Code fragments in books are not tested, and although they never reach production, the bugs they contain do propagate virally via unsuspecting readers who learn bogus methods and idioms. A very bad and lasting side effect.

  • Kudos to you for catching this! You have a rare skill among programmers: you are a good reader. There are far more programmers who write code than programmers who can read code correctly and see mistakes. Hone this skill by reading other people's code, on stack overflow or from open source projects... And do report the bugs.

  • The subtraction method is in common use, I have seen it in many places like you and it does work for most value pairs. This bug may go unnoticed for eons. A similar problem was latent in the zlib for decades: int m = (a + b) / 2; causes a fateful integer overflow for large int values of a and b.

  • The author probably saw it used and thought the subtraction was cool and fast, worth showing in print.

  • Note however that the erroneous function does work correctly for types smaller than int: signed or unsigned char and short, if these types are indeed smaller than int on the target platform, which the C Standard does not mandate.

  • Indeed similar code can be found in The C Programming Language by Brian Kernighan and Dennis Ritchie, the famous K&R C bible by its inventors. They use this approach in a simplistic implementation of strcmp() in chapter 5. The code in the book is dated, going all the way back to the late seventies. Although it has implementation defined behavior, it does not invoke undefined behavior in any but the rarest architectures among which the infamous DeathStation-9000, yet it should not be used to compare int values.

Lecialecithin answered 5/4, 2018 at 6:44 Comment(0)
K
1

You are correct, *(int*)a - *(int*)b poses a risk of integer overflow and ought to be avoided as a method of comparing two int values.

It is possible it could be valid code in a controlled situation where one knows the values are such that the subtraction will not overflow. In general, though, it should be avoided.

Kara answered 5/4, 2018 at 4:12 Comment(0)
C
1

The reason why so many books are wrong is likely the root of all evil: the K&R book. In chapter 5.5 they try to teach how to implement strcmp:

int strcmp(char *s, char *t)
{
  int i;
  for (i = 0; s[i] == t[i]; i++)
    if (s[i] == '\0')
      return 0;
  return s[i] - t[i];
}

This code is questionable since char has implementation-defined signedness. Ignoring that, and ignoring that they fail to use const correctness as in the standard C version, the code otherwise works, partially because it relies on implicit type promotion to int (which is ugly), partially since they assume 7 bit ASCII, and the worst case 0 - 127 cannot underflow.

Further down in the book, 5.11, they try to teach how to use qsort:

qsort((void**) lineptr, 0, nlines-1,
  (int (*)(void*,void*))(numeric ? numcmp : strcmp));

Ignoring the fact that this code invokes undefined behavior, since strcmp is not compatible with the function pointer int (*)(void*, void*), they teach to use the above method from strcmp.

However, looking at their numcmp function, it looks like this:

/* numcmp: compare s1 and s2 numerically */
int numcmp(char *s1, char *s2)
{
  double v1, v2;
  v1 = atof(s1);
  v2 = atof(s2);
  if (v1 < v2)
    return -1;
  else if (v1 > v2)
    return 1;
  else
    return 0;
}

Ignoring the fact that this code will crash and burn if an invalid character is found by atof (such as the very likely locale issue with . versus ,), they actually manage to teach the correct method of writing such a comparison function. Since this function uses floating point, there's really no other way to write it.

Now someone might want to come up with an int version of this. If they do it based on the strcmp implementation rather than the floating point implementation, they'll get bugs.

Overall, just by flipping a few pages in this once canonical book, we already found some 3-4 cases of reliance on undefined behavior and 1 case of reliance on implementation-defined behavior. So it is really no wonder if people who learn C from this book writes code full of undefined behavior.

Cybil answered 5/4, 2018 at 10:3 Comment(6)
Note that this implementation of strcmp() does not have undefined behavior (unless sizeof(int) == 1 and char is signed, a pathological and rare occurrence). It is correct if char is unsigned by default, and it can be fixed if char is signed by default by changing the last statement to return (unsigned char)s[i] - (unsigned char)t[i];Lecialecithin
@Lecialecithin Actually, the integer promotion makes the type of the return statement int no matter the type. Ok so it is not UB, just badly written code, with reliance on implicit promotions.Cybil
Yes, the return type is fine. Even with the (unsigned char) casts, each value is promoted to int and the difference is computed as int and returned as such. The casts are needed for a conforming implementation of strcmp that is specified in the C Standard as comparing the characters based on their unsigned char values.Lecialecithin
@Lecialecithin Where does the standard say that in strcmp, characters are compared based on unsigned char values? And the cast does not do that, because of the mentioned integer promotion to (signed) int.Cybil
7.24.4 Comparison functions The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared. Furthermore, the casts do exactly what is needed. The char values are converted to unsigned char, then promoted to int.Lecialecithin
@Lecialecithin Alright, I wasn't aware of that restriction. Then indeed the K&R version is non-compliant. The text you cite from the standard is probably as much to blame as K&R though - but then of course the standard could have gotten the idea from K&R initially.Cybil
B
0

First, it's of course correct that an integer during the comparison could create serious problems for you.

On the other hand, doing a single subtraction is cheaper than going through an if/then/else, and the comparison gets performed O(n^2) times in a quicksort, so if this sort is performance-critical and we can get away with it we may want to use the difference.

It will work fine so long as all the values are in some range of size less than 2^31, because then their differences have to be smaller. So if whatever is generating the list you want to sort is going to keep values between a billion and minus one billion then you're fine using subtraction.

Note that checking that the values are in such a range prior to the sort is an O(n) operation.

On the other hand if there's a chance that the overflow could happen, you'd want to use something like the code you wrote in your question

Note that lots of stuff you see doesn't explicitly take overflow into account; it's just that maybe that's more expected in something that's more obviously an "arithmetic" context.

Baggott answered 5/4, 2018 at 4:14 Comment(1)
O(n^2) is worst case and rare. Average is O(n log n).Kara

© 2022 - 2024 — McMap. All rights reserved.