Can't get strcoll() to use locales when sorting in C
Asked Answered
J

2

5

I have not been able to get locale-dependent functions such as strcoll() to work in C. I am wondering whether I am doing something wrong and/or how to get this to work. Here is a sample program from this book: Prinz, Peter, and Tony Crawford. 2016. C in a Nutshell, 2nd edn., p. 574. Beijing-Boston-Farnham-Sebastopol-Tokyo: O'Reilly. ISBN-13: 978-1-491-90475-6.

#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(void) {
   char *samples[ ] = { "curso", "churro" };
   setlocale(LC_COLLATE, "es_ES.UTF-8");
   int result = strcoll(samples[0], samples[1]);
   if(result == 0) {
      printf("The strings \"%s\" and \"%s\" are "
             "alphabetically equivalent.\n",
             samples[0], samples[1]);
   } else if(result < 0) {
      printf("The string \"%s\" comes before \"%s\" "
             "alphabetically.\n",
             samples[0], samples[1]);
   } else if(result > 0) {
      printf("The string \"%s\" comes after \"%s\" "
             "alphabetically.\n",
             samples[0], samples[1]);
   }
   return(0);
}

The book says that "curso" should come BEFORE "churro", because in Spanish "ch" is considered a separate letter for purposes of alphabetization. However, when I run this program it prints that "curso" comes AFTER "churro". I do not know Spanish, but I have tested this program with several other languages that I do know, and the result is always that of strcmp(), a strictly numerical comparison.

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
$ locale -a | grep es_ES.utf8             
es_ES.utf8

I am aware of this question: Getting locale functions to work in glibc The author says that locale-dependent functions such as strcoll perform poorly in glibc, and that he was writing his own modifications of it.

Am I missing something? Does this simply not work?

Janiculum answered 28/12, 2022 at 15:49 Comment(9)
Does this work in C which has no idea what UTF-8 is?Roesch
That is a good question. The book says "The value of the locale information category LC_COLLATE determines the applicable rule set, and can be changed by the setlocale() function." The example uses a UTF-8 locale, and I have written many programs in C that can handle UTF-8, so I assumed that strcoll() was also written in a way that could do this.Janiculum
This piece of code works as expected on macOS, but not on Ubuntu. Meaning on my mac it says The string "curso" comes before "churro" alphabetically. but on Ubuntu it says The string "curso" comes after "churro" alphabetically.. I made sure to have es_ES.UTF-8 installed on both systems.Sheugh
Check the return value of setlocale(). If it returns NULL it means that "es_ES.UTF-8" was not honored, and leaves local unchanged.Kaplan
Your book has outdated information. The digraph ch is not considered a single letter since 1994. See rae.es/dpd/abecedario. "en el X Congreso de la Asociación de Academias de la Lengua Española, celebrado en 1994, se acordó adoptar el orden alfabético latino universal, en el que la ch y la ll no se consideran letras independientes."Snowmobile
You can also look at the Unicode collation data here. As you can see, there are several collation orders. The standard one does not consider ch and ll special, while the traditional one does. Linux/glibc implements the standard collation. You can check that your locale collation is working with accented characters.Snowmobile
@ryykker: I did check that by simply printing out its value, but I omitted that from the code that I posted. That's not the problem.Janiculum
@n.m.: Thank you for the information about Spanish. I do not know Spanish, so that is good to know. The example that I quoted from the book used Spanish, and I simply copied it. However, I did check the result using other languages that I do know, and the problem is the same.Janiculum
There is no problem with Spanish. The program outputs what is expected. If you see a problem with some other language, please show it.Snowmobile
S
5

Your book has outdated information. The Spanish digraph ch is not considered a single letter since 1994. See https://rae.es/dpd/abecedario.

en el X Congreso de la Asociación de Academias de la Lengua Española, celebrado en 1994, se acordó adoptar el orden alfabético latino universal, en el que la ch y la ll no se consideran letras independientes."

(Hope no translation is needed)

You can also look at the Unicode collation data here. This is the source glibc derives its collation data from. As you can see, there are several collation orders. The standard one does not consider ch and ll special, while the traditional one does. Glibc implements the standard collation.

You can check that your Spanish locale collation is working by trying strings with accented characters. Those should come in the order described by the collation order (i.e. right after the corresponding non-accented character) if the system is working, and after all non-accented letters if it does not (i.e. if you forget to call setlocale or the locale is not supported). Demo Note, on godbolt GCC does not support locales, while MSVC does (and with the Unix-like locale names to boot).

If you want to test multi-character collation, use the Czech locale (cs_CZ.UTF-8), it does recognise ch as a single letter and it comes after h in the collation order. Demo.

Snowmobile answered 28/12, 2022 at 17:19 Comment(3)
I needed it translated :) : "At the X Congress of the Association of Spanish Language Academies, held in 1994, it was agreed to adopt the universal Latin alphabetical order, in which ch and ll are not considered independent letters.""Kaplan
Hello again @n.m. I think that you have identified the problem, namely that there are multiple possible sort orders. Another thing that makes the question of collation very confusing is that there can be different collation levels. I won't try to explain this here, but this webpage explains it: docs.oracle.com/database/121/NLSPG/ch5lingsort.htm#NLSPG271 I.e., even if one character "comes after" another, they might be considered identical at a certain level of sorting, when the next character is considered. This makes sorting French very difficult. Thank you and everyone who posted!Janiculum
@ThomasHedden Different levels of collation is a concern for people who implement collation. For the end users they are normally not a problem.Snowmobile
K
1

"Am I missing something? Does this simply not work?"

I think it comes down to whether or not your environment recognizes "es_ES.UTF-8"

Note, I do not have access to a Linux environment, which waters down the ability to compare apples with apples. But I hope the following highlights a few things that might help...

On Windows, and using a standard LabWindows/CVI compiler (my version is based on Clang 3.3) it outputs the following:

"The string "curso" comes after "churro" alphabetically."

which appears to be incorrect according to your stated expectations when using the Spanish alphabetization rules.

I suspect implementation and version of libraries contribute to what we are seeing.

Note that when later I checked the return of setlocale:

char *new = setlocale(LC_COLLATE, "es_ES.UTF-8");

It came back NULL, indicating the following:

"If locale is non-NULL and can be honored, a pointer to the string associated with the specified category is returned. If the All Categories setting is selected, then the strings contain a concatenation of the locales for the different categories." If the selection cannot be honored, the function returns a NULL pointer and the program's locale remains unchanged.

indicating that "es_ES.UTF-8" was not honored, leaving locale unchanged.
This article has some interesting and related insights into using UTF-8 in C. (...and how it relates to the locale problems seen here.)

Kaplan answered 28/12, 2022 at 16:14 Comment(3)
Windows used to have (perhaps still has) a very different set of locales from Linux. "es_ES.UTF-8" is a standard locale that should be present on any out-of-the-box installation of Ubuntu. Not so on Windows,Snowmobile
MSVC on godbolt has the locale, and it said to run on some kind of Windows server. Their installation of gcc has not, but then they probably don't run an out-of-the-box desktop Linux OS.Snowmobile
@ryyker: Thank you for that hint, but I did check that. I added all the locales that I tested using # locale-gen <locale name> . In this case, that is not the problem.Janiculum

© 2022 - 2025 — McMap. All rights reserved.