Displaying wide chars with printf
Asked Answered
A

3

22

I'm trying to understand how does printf work with wide characters (wchar_t).

I've made the following code samples :

Sample 1 :

#include <stdio.h>
#include <stdlib.h>

int     main(void)
{
    wchar_t     *s;

    s = (wchar_t *)malloc(sizeof(wchar_t) * 2);
    s[0] = 42;
    s[1] = 0;
    printf("%ls\n", s);
    free(s);
    return (0);
}

output :

*

Everything is fine here : my character (*) is correctly displayed.

Sample 2 :

I wanted to display an other kind of character. On my system, wchar_t seem encoded on 4 bytes. So I tried to display the following character : É

#include <stdio.h>
#include <stdlib.h>

int     main(void)
{
    wchar_t     *s;

    s = (wchar_t *)malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC389;
    s[1] = 0;
    printf("%ls\n", s);
    free(s);
    return (0);
}

But there is no output this time, I tried with many values from the "encoding" section (cf. previous link) for s[0] (0xC389, 201, 0xC9)... But I never get the É character displayed. I also tried with %S instead of %ls.

If I try to call printf like this : printf("<%ls>\n", s) the only character printed is '<', the display is truncated.

Why do I have this problem? How should I do?

Alcott answered 14/11, 2016 at 13:45 Comment(5)
Is there a reason you allocate dynamically instead of declaring an array of two elements?Pumphrey
Try reading with scanf("%1ls") a "É" and report what value for printf("%lX\n", (unsigned long) s[0]) you get.Wendolyn
@chux printf("%ld\n", (unsigned long int) L'É'); gives me 201.Alcott
Suggest report the result of "reading with scanf("%1ls") an "É". Your comment reports what the source code thinks a 'É' is. We are interested in how the code handles the I/O, which may differ in character encoding.Wendolyn
On my system, the return value from scanf("%1ls", s); is -1 (s[0] not set), which supports https://mcmap.net/q/573594/-displaying-wide-chars-with-printfWendolyn
S
31

Why do I have this problem?

Make sure you check errno and the return value of printf!

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(void)
{
    wchar_t *s;
    s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC389;
    s[1] = 0;

    if (printf("%ls\n", s) < 0) {
        perror("printf");
    }

    free(s);
    return (0);
}

See the output:

$ gcc test.c && ./a.out
printf: Invalid or incomplete multibyte or wide character

How to fix

First of all, the default locale of a C program is C (also known as POSIX) which is ASCII-only. You will need to add a call to setlocale, specifically setlocale(LC_ALL,"").

If your LC_ALL, LC_CTYPE or LANG environment variables are not set to allow UTF-8 when blank, you'll have to explicitly select a locale. setlocale(LC_ALL, "C.UTF-8") works on most systems - C is standard, and the UTF-8 subset of C is generally implemented.

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>

int main(void)
{
    wchar_t *s;
    s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC389;
    s[1] = 0;

    setlocale(LC_ALL, "");

    if (printf("%ls\n", s) < 0) {
        perror("printf");
    }

    free(s);
    return (0);
}

See the output:

$ gcc test.c && ./a.out
쎉

The reason why the incorrect character printed out is because wchar_t represents a wide character (such as UTF-32), not a multibyte character (such as UTF-8). Note that wchar_t is always 32 bits wide in the GNU C Library, but the C standard doesn't require it to be. If you initialize the character using the UTF-32BE encoding (i.e. 0x000000C9), then it prints out correctly:

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>

int main(void)
{
    wchar_t *s;
    s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC9;
    s[1] = 0;

    setlocale(LC_ALL, "");

    if (printf("%ls\n", s) < 0) {
        perror("printf");
    }

    free(s);
    return (0);
}

Output:

$ gcc test.c && ./a.out
É

Note that you can also set the LC (locale) environment variables via command line:

$ LC_ALL=C.UTF-8
$ ./a.out
É
Saccharin answered 15/11, 2016 at 1:33 Comment(1)
Not sure why it is not working for me on Windows 11 with VS 2019, error illegal byte sequenceAmbiversion
P
6

One problem is that you are trying to encode UTF-8, which is a single-byte encoding scheme, as a multi-byte encoding. For UTF-8 you use plain char.

Also note that because you try to combine the UTF-8 sequence into a multi-byte type, you have endianness (byte-order) issues (in memory 0xC389 might be stored as 0x89 and 0xC3, in that order). And that the compiler will sign-extend your number as well (if sizeof(wchar_t) == 4 and you look at s[0] in a debugger it might be 0xFFFFC389).

Another problem is the terminal or console you use to print. Maybe it simply doesn't support UTF-8 or the other encodings you tried?

Pumphrey answered 14/11, 2016 at 13:57 Comment(0)
A
6

I found a simple way to print wide chars. One key point is setlocale()

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char *argv[])
{
    setlocale(LC_ALL, "");
    // setlocale(LC_ALL, "C.UTF-8"); // this also works

    wchar_t hello_eng[] = L"Hello World!";
    wchar_t hello_china[] = L"世界, 你好!";
    wchar_t *hello_japan = L"こんにちは日本!";
    printf("%ls\n", hello_eng);
    printf("%ls\n", hello_china);
    printf("%ls\n", hello_japan);

    return 0;
}
Ainslie answered 31/1, 2021 at 2:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.