Why does wprintf transliterate Russian text in Unicode into Latin on Linux?
Asked Answered
M

2

36

Why does the following program

#include <stdio.h>
#include <wchar.h>

int main() {
  wprintf(L"Привет, мир!");
}

print "Privet, mir!" on Linux? Specifically, why does it transliterate Russian text in Unicode into Latin as opposed to transcoding it into UTF-8 or using replacement characters?

Demonstration of this behavior on Godbolt: https://godbolt.org/z/36zEcG

The non-wide version printf("Привет, мир!") prints this text as expected ("Привет, мир!").

Misunderstanding answered 29/12, 2020 at 15:17 Comment(4)
Out of curiosity, why even use wchar on Linux?Brownlee
There is no reason to use wchar_t since it's non-portable. I just came across this "interesting" behavior when answering another SO question: https://mcmap.net/q/223060/-how-i-can-print-the-wchar_t-values-to-console,Misunderstanding
In my system, it just prints ??????, ???!. Could you check /usr/share/i18n/locales/C and see if there are any rules starting with translit in there?Chryselephantine
@Heinzi, you can check locales on godbolt if interested - there is a link in the question.Misunderstanding
K
33

Because conversion of wide characters is done according to the currently set locale. By default a C program always starts with a "C" locale which only supports ASCII characters.

You have to switch to any Russian or UTF-8 locale first:

setlocale(LC_ALL, "ru_RU.utf8"); // Russian Unicode
setlocale(LC_ALL, "en_US.utf8"); // English US Unicode

Or to a current system locale (which is likely what you need):

setlocale(LC_ALL, "");

The full program will be:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
  setlocale(LC_ALL, "ru_RU.utf8");
  wprintf(L"Привет, мир!\n");
}

As for your code working as-is on other machines - this is due to how libc operates there. Some implementations (like musl) do not support non-Unicode locales and thus can unconditionally translate wide characters to an UTF-8 sequence.

K2 answered 29/12, 2020 at 15:24 Comment(8)
It prints prints verbatim Privet, mir! when I run it on godbolt with or without setlocale(LC_ALL, "ru_RU.utf8") or setlocale(LC_ALL, "").Aggrieved
But why transliteration? Is it documented somewhere?Misunderstanding
@Aggrieved Do you have "ru_RU.utf8" locale installed on your computer? If not, then setting it will fail. Use "" (default locale) which is likely an UTF-8 one. Any unicode locale will do.K2
@Misunderstanding I am not sure tbh, but I think it is just illegal to output those characters without locale and libc probably can do whatever it wants. Transliteration is a nice way to produce valid and still readable output.K2
@Aggrieved what locale are you using then? Try "en_US.utf8" if you are in US.K2
After generating ru_RU.UTF-8 locale, the program works for me. Note that the first call to any stdout functions has to be done after setlocale.Malikamalin
@Malikamalin it should work with default locale too (if it is set to a unicode locale) - it is important to have this program generate expected output on any machine with unicode support and not tie it to a specific language.K2
One thing to make sure is that you have a utf compatible locale installed using locale -a in a terminal. Then select one from the list that command provides.Drawer
M
10

why does it transliterate Russian text in Unicode into Latin as opposed to transcoding it into UTF-8 or using replacement characters?

Because the starting locale of your program is the default one, the C locale. So it's translating wide string into C locale. C locale doesn't handle UTF-8 nor any unicode, so your standard library does it's best to translate wide characters into some basic character set used in C locale.

You may change the locale to any UTF-8 locale and the program should output UTF-8 string.

Note: (in implementation I know of) the encoding of the FILE stream is determined and saved at the time the stream orientation (wide vs normal) is chosen. Remember to set the locale before doing anything with stdout (ie. this vs this).

Malikamalin answered 29/12, 2020 at 15:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.