Is the strcasecmp algorithm flawed?
Asked Answered
T

4

34

I am trying to reimplement the strcasecmp function in C and I noticed what appears to be an inconsistency in the comparison process.

From man strcmp

The strcmp() function compares the two strings s1 and s2. The locale is not taken into account (for a locale-aware comparison, see strcoll(3)). It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

From man strcasecmp

The strcasecmp() function performs a byte-by-byte comparison of the strings s1 and s2, ignoring the case of the characters. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

int strcmp(const char *s1, const char *s2);
int strcasecmp(const char *s1, const char *s2);

Given, this information, I don't understand the result of the following code:

#include <stdio.h>
#include <string.h>

int main()
{
    // ASCII values
    // 'A' = 65
    // '_' = 95
    // 'a' = 97

    printf("%i\n", strcmp("A", "_"));
    printf("%i\n", strcmp("a", "_"));
    printf("%i\n", strcasecmp("A", "_"));
    printf("%i\n", strcasecmp("a", "_"));
    return 0;
}

Ouput:

-1  # "A" is less than "_"
1   # "a" is more than "_"
2   # "A" is more than "_" with strcasecmp ???
2   # "a" is more than "_" with strcasecmp

It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

Can someone explain this behaviour? Shouldn't the first and third lines be identical?

Thank you in advance!

PS:
I am using gcc 9.2.0 on Manjaro.
Also, when I compile with the -fno-builtin flag I get instead:

-30
2
2
2

I guess it's because the program does not use gcc's optimised functions, but the question remains.

Toffic answered 21/2, 2020 at 16:10 Comment(7)
Add another test case to your set: printf("%i\n", strcasecmp("a", "_")); This should presumably have the same result as printf("%i\n", strcasecmp("A", "_")); But that means that one of these two case-insensitive calls is going to disagree with its case-sensitive counterpart.Intervale
It seems the description of strcasecmp you're refering to is not accurate. More details in the upvoted answers.Costermansville
It's the only thing that makes sense. A function that says A < _ && a > _ && A == a would cause so many problems.Autocrat
Aside: "I am trying to reimplement the strcasecmp function in C " --> Although code not shown, be sure to compare "as if"unsigned char. C17/18 "String handling <string.h>" --> "For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char". This makes a difference once char values are outside ASCII range 0-127.Mcafee
@chux-ReinstateMonica Thanks for the tip. I actutally want to implement it in assembly, but I didn't mention it to avoid confusion, as it wasn't necessary.Toffic
writing this in assembly is a bad idea. You won't be able to beat optimized library functions with SIMD or other special techniquesPontiff
On the differences in the outputs with built-ins and without: Both say the same, as their results are identically <0 and >0, and you don't have an example for ==0. But you can see the algorithms shine through: some of the returned values are the differences of the first non-equal character.Inerrant
S
29

The behavior is correct.

Per the POSIX str\[n\]casecmp() specification:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

That is also part of the NOTES section of the Linux man page:

The POSIX.1-2008 standard says of these functions:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

Why?

As @HansOlsson pointed out in his answer, doing case-insensitive comparisons between only letters and allowing all other comparisons to have their "natural" results as done in strcmp() would break sorting.

If 'A' == 'a' (the definition of a case-insensitive comparison) then '_' > 'A' and '_' < 'a' (the "natural" results in the ASCII character set) can not both be true.

Sonia answered 21/2, 2020 at 16:19 Comment(9)
Doing case-insensitive comparisons between only letters would not result in '_' > 'A' && '_' < 'a'; doesn't seem like the best example.Osterman
@AsteroidsWithWings Those are the characters used in the question. And if 'a' == 'A' by definition, if you do a comparison between the "natural" values of 'a', 'A', and '_', you can not do a case-insensitive comparison between 'A' and 'a' to get equality and get consistent sort results.Sonia
I'm not disputing that, but the specific counter-example you provided doesn't seem to be relevant.Osterman
@AsteroidsWithWings Go through the mental exercise of building a binary tree from 'a', 'A', and '_', going through all 6 orders of insertion into the tree, and comparing the results from the as-specified "always lowercase letters" to the question's proposed "only convert letters when it's a letter-to-letter comparison". For example, using the latter algorithm and starting with '_', 'a' and 'A' wind up on opposite sides of the tree yet they're defined as equal. The "only convert letters to lower case in letter-letter comparisons" algorithm is broken and those 3 chars show that.Sonia
Okay, then I suggest demonstrating that in the answer because at the moment it just jumps to pointing out that "'_' > 'A' and '_' < 'a' can not both be true" without telling us why we should ever have thought it would be. (That's a task for the answerer, not for one of millions of readers.)Osterman
@AsteroidsWithWings The entire sentence states "If 'A' == 'a' (the definition of a case-insensitive comparison) then '_' > 'A' and '_' < 'a' can not both be true."Sonia
I know what the sentence says... I can read... that did not address my comment at all. Alright, just forget it. Trying to help you improve your answer but I can only lead a horse to water! Have a good one.Osterman
@AsteroidsWithWings I know what the sentence says... Then don't selectively misquote something that no one else has a problem with in a way that implies something clearly stated was left out. No one is preventing you from posting what you think is a better answer.Sonia
I never claimed that that first part of the sentence was left out. I'm saying it's insufficient for the purpose of establishing why the second part is useful/interesting/relevant. Again, I'm not saying it isn't, just that your answer as currently written does not explain that. I'm sorry that you did not like my feedback. Furthermore, what on earth does this have to do with me writing a hypothetical additional answer? Too many non sequiturs.Osterman
B
22

Other links, http://man7.org/linux/man-pages/man3/strcasecmp.3p.html for strcasecmp say that converting to lower-case is the correct behavior (at least in POSIX locale).

The reason for that behavior is that if you use strcasecmp to sort an array of strings it is needed to get reasonable results.

Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons.

Buddy answered 21/2, 2020 at 16:19 Comment(2)
Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons. Good point. That's likely the reason POSIX specifies the behavior.Sonia
More concretely, you need a total order for sorting, which would not be the case if you define the comparison as in the question (since it wouldn't be transitive).Auden
M
8

It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

That's correct - and it's what the strcasecmp() function should do! It is a POSIX function, rather than part of the C Standard but, from the "The Open Group Base Specifications, Issue 6":

In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.

Incidentally, this behaviour is also pertintent to the _stricmp() function (as used in Visual Studio/MSCV):

The _stricmp function ordinally compares string1 and string2 after converting each character to lowercase, and returns a value indicating their relationship.

Mettle answered 21/2, 2020 at 16:22 Comment(0)
H
2

The ASCII decimal code for A is 65 for _ is 95 and for a is 97, so strcmp() it's doing what it's suppose to do. Lexicographically speaking _ is smaller then a and bigger than A.

strcasecmp() will regard A as being a*, and since a is bigger than _ the output is also correct.

*The POSIX.1-2008 standard says of these functions (strcasecmp() and strncasecmp()):

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

Source: http://man7.org/linux/man-pages/man3/strcasecmp.3.html

Hugohugon answered 21/2, 2020 at 16:17 Comment(3)
OP's point is that A is "bigger" than _ when comparing case-insensitively, and wonders why the result is not the same as when comparing case-sensitively.Intervale
The statement Since strcasecmp()` is case insensitive it will regard A as being a` is an invalid deduction. A case-insensitive routine could treat all uppercase letters as if they were lowercase letters, could treat all lowercase letters as if they were uppercase letters, or could treat each uppercase letter as equal to its corresponding lowercase letter and vice-versa but still compare them to non-letter characters with their raw values. This answer does not state a reason for preferring any of those possibilities (the correct reason for which is the documentation says to use lowercase).Bs
@EricPostpischil The POSIX.1-2008 standard says of these functions (strcasecmp() and strncasecmp()): When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.Hugohugon

© 2022 - 2024 — McMap. All rights reserved.