Is the strcasecmp algorithm flawed?

Asked 21/2, 2020 at 16:10 Answered 21/2, 2020 at 16:22

I am trying to reimplement the strcasecmp function in C and I noticed what appears to be an inconsistency in the comparison process.

From man strcmp

The strcmp() function compares the two strings s1 and s2. The locale is not taken into account (for a locale-aware comparison, see strcoll(3)). It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

From man strcasecmp

The strcasecmp() function performs a byte-by-byte comparison of the strings s1 and s2, ignoring the case of the characters. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

int strcmp(const char *s1, const char *s2);
int strcasecmp(const char *s1, const char *s2);

Given, this information, I don't understand the result of the following code:

#include <stdio.h>
#include <string.h>

int main()
{
    // ASCII values
    // 'A' = 65
    // '_' = 95
    // 'a' = 97

    printf("%i\n", strcmp("A", "_"));
    printf("%i\n", strcmp("a", "_"));
    printf("%i\n", strcasecmp("A", "_"));
    printf("%i\n", strcasecmp("a", "_"));
    return 0;
}

Ouput:

-1  # "A" is less than "_"
1   # "a" is more than "_"
2   # "A" is more than "_" with strcasecmp ???
2   # "a" is more than "_" with strcasecmp

It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

Can someone explain this behaviour? Shouldn't the first and third lines be identical?

Thank you in advance!

PS:
I am using gcc 9.2.0 on Manjaro.
Also, when I compile with the -fno-builtin flag I get instead:

I guess it's because the program does not use gcc's optimised functions, but the question remains.

Toffic answered 21/2, 2020 at 16:10 Comment(7)

Add another test case to your set: printf("%i\n", strcasecmp("a", "_")); This should presumably have the same result as printf("%i\n", strcasecmp("A", "_")); But that means that one of these two case-insensitive calls is going to disagree with its case-sensitive counterpart. – Intervale 21/2, 2020 at 16:15

It seems the description of strcasecmp you're refering to is not accurate. More details in the upvoted answers. – Costermansville 21/2, 2020 at 16:58

It's the only thing that makes sense. A function that says A < _ && a > _ && A == a would cause so many problems. – Autocrat 21/2, 2020 at 17:25

Aside: "I am trying to reimplement the strcasecmp function in C " --> Although code not shown, be sure to compare "as if"unsigned char. C17/18 "String handling <string.h>" --> "For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char". This makes a difference once char values are outside ASCII range 0-127. – Mcafee 21/2, 2020 at 17:37

@chux-ReinstateMonica Thanks for the tip. I actutally want to implement it in assembly, but I didn't mention it to avoid confusion, as it wasn't necessary. – Toffic 21/2, 2020 at 18:14

writing this in assembly is a bad idea. You won't be able to beat optimized library functions with SIMD or other special techniques – Pontiff 22/2, 2020 at 14:10

On the differences in the outputs with built-ins and without: Both say the same, as their results are identically <0 and >0, and you don't have an example for ==0. But you can see the algorithms shine through: some of the returned values are the differences of the first non-equal character. – Inerrant 22/2, 2020 at 15:38

The behavior is correct.

Per the POSIX str\[n\]casecmp() specification:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

That is also part of the NOTES section of the Linux man page:

The POSIX.1-2008 standard says of these functions:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

Why?

As @HansOlsson pointed out in his answer, doing case-insensitive comparisons between only letters and allowing all other comparisons to have their "natural" results as done in strcmp() would break sorting.

If 'A' == 'a' (the definition of a case-insensitive comparison) then '_' > 'A' and '_' < 'a' (the "natural" results in the ASCII character set) can not both be true.

Sonia answered 21/2, 2020 at 16:19 Comment(9)

Doing case-insensitive comparisons between only letters would not result in '_' > 'A' && '_' < 'a'; doesn't seem like the best example. – Osterman 22/2, 2020 at 14:44

@AsteroidsWithWings Those are the characters used in the question. And if 'a' == 'A' by definition, if you do a comparison between the "natural" values of 'a', 'A', and '_', you can not do a case-insensitive comparison between 'A' and 'a' to get equality and get consistent sort results. – Sonia 22/2, 2020 at 18:54

I'm not disputing that, but the specific counter-example you provided doesn't seem to be relevant. – Osterman 22/2, 2020 at 19:27

@AsteroidsWithWings Go through the mental exercise of building a binary tree from 'a', 'A', and '_', going through all 6 orders of insertion into the tree, and comparing the results from the as-specified "always lowercase letters" to the question's proposed "only convert letters when it's a letter-to-letter comparison". For example, using the latter algorithm and starting with '_', 'a' and 'A' wind up on opposite sides of the tree yet they're defined as equal. The "only convert letters to lower case in letter-letter comparisons" algorithm is broken and those 3 chars show that. – Sonia 23/2, 2020 at 12:49

Okay, then I suggest demonstrating that in the answer because at the moment it just jumps to pointing out that "'_' > 'A' and '_' < 'a' can not both be true" without telling us why we should ever have thought it would be. (That's a task for the answerer, not for one of millions of readers.) – Osterman 23/2, 2020 at 15:7

@AsteroidsWithWings The entire sentence states "If 'A' == 'a' (the definition of a case-insensitive comparison) then '_' > 'A' and '_' < 'a' can not both be true." – Sonia 23/2, 2020 at 15:23

I know what the sentence says... I can read... that did not address my comment at all. Alright, just forget it. Trying to help you improve your answer but I can only lead a horse to water! Have a good one. – Osterman 23/2, 2020 at 17:2

@AsteroidsWithWings I know what the sentence says... Then don't selectively misquote something that no one else has a problem with in a way that implies something clearly stated was left out. No one is preventing you from posting what you think is a better answer. – Sonia 23/2, 2020 at 17:44

I never claimed that that first part of the sentence was left out. I'm saying it's insufficient for the purpose of establishing why the second part is useful/interesting/relevant. Again, I'm not saying it isn't, just that your answer as currently written does not explain that. I'm sorry that you did not like my feedback. Furthermore, what on earth does this have to do with me writing a hypothetical additional answer? Too many non sequiturs. – Osterman 23/2, 2020 at 18:29

Other links, http://man7.org/linux/man-pages/man3/strcasecmp.3p.html for strcasecmp say that converting to lower-case is the correct behavior (at least in POSIX locale).

The reason for that behavior is that if you use strcasecmp to sort an array of strings it is needed to get reasonable results.

Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons.

Buddy answered 21/2, 2020 at 16:19 Comment(2)

Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons. Good point. That's likely the reason POSIX specifies the behavior. – Sonia 21/2, 2020 at 16:22

More concretely, you need a total order for sorting, which would not be the case if you define the comparison as in the question (since it wouldn't be transitive). – Auden 22/2, 2020 at 17:57

It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

That's correct - and it's what the strcasecmp() function should do! It is a POSIX function, rather than part of the C Standard but, from the "The Open Group Base Specifications, Issue 6":

In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.

Incidentally, this behaviour is also pertintent to the _stricmp() function (as used in Visual Studio/MSCV):

The _stricmp function ordinally compares string1 and string2 after converting each character to lowercase, and returns a value indicating their relationship.

Mettle answered 21/2, 2020 at 16:22 Comment(0)

The ASCII decimal code for A is 65 for _ is 95 and for a is 97, so strcmp() it's doing what it's suppose to do. Lexicographically speaking _ is smaller then a and bigger than A.

strcasecmp() will regard A as being a*, and since a is bigger than _ the output is also correct.

*The POSIX.1-2008 standard says of these functions (strcasecmp() and strncasecmp()):

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

Source: http://man7.org/linux/man-pages/man3/strcasecmp.3.html

Hugohugon answered 21/2, 2020 at 16:17 Comment(3)

OP's point is that A is "bigger" than _ when comparing case-insensitively, and wonders why the result is not the same as when comparing case-sensitively. – Intervale 21/2, 2020 at 16:21

The statement Since strcasecmp()` is case insensitive it will regard A as being a` is an invalid deduction. A case-insensitive routine could treat all uppercase letters as if they were lowercase letters, could treat all lowercase letters as if they were uppercase letters, or could treat each uppercase letter as equal to its corresponding lowercase letter and vice-versa but still compare them to non-letter characters with their raw values. This answer does not state a reason for preferring any of those possibilities (the correct reason for which is the documentation says to use lowercase). – Bs 21/2, 2020 at 17:20

@EricPostpischil The POSIX.1-2008 standard says of these functions (strcasecmp() and strncasecmp()): When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified. – Hugohugon 21/2, 2020 at 17:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags