Does wide character input/output in C always read from / write to the correct (system default) encoding?
Asked Answered
Z

3

11

I'm primarily interested in the Unix-like systems (e.g., portable POSIX) as it seems like Windows does strange things for wide characters.

Do the read and write wide character functions (like getwchar() and putwchar()) always "do the right thing", for example read from utf-8 and write to utf-8 when that is the set locale, or do I have to manually call wcrtomb() and print the string using e.g. fputs()? On my system (openSUSE 12.3) where $LANG is set to en_GB.UTF-8 they do seem to do the right thing (inspecting the output I see what looks like UTF-8 even though strings were stored using wchar_t and written using the wide character functions).

However I am unsure if this is guaranteed. For example cprogramming.com states that:

[wide characters] should not be used for output, since spurious zero bytes and other low-ASCII characters with common meanings (such as '/' and '\n') will likely be sprinkled throughout the data.

Which seems to indicate that outputting wide characters (presumably using the wide character output functions) can wreak havoc.

Since the C standard does not seem to mention coding at all I really have no idea who/when/how coding is applied when using wchar_t. So my question is basically if reading, writing and using wide characters exclusively is a proper thing to do when my application has no need to know about the encoding used. I only need string lengths and console widths (wcswidth()), so to me using wchar_t everywhere when dealing with text seems ideal.

Zachery answered 16/3, 2013 at 22:51 Comment(0)
C
10

The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02

Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE locale category) at the time the FILE stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide is used to set the orientation to wide. So as long as a proper LC_CTYPE locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.

However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr is byte oriented (and some even makes the same assumption about stdout), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.

Really, I can't think of any reason at all to use wide-oriented functions. fprintf is perfectly capable of sending wide-character strings to byte-oriented FILE streams using the %ls specifier.

Centillion answered 17/3, 2013 at 3:34 Comment(5)
I assume using putwchar(wc) yields better performance than having to use printf("%lc", wc), but for my current use that performance difference probably isn't important. But just to be clear, setting stdout to wide orientation will only be problematic if library functions actually write to stdout, correct?Zachery
@Quantumboredom: Yes. stdout begins with no orientation, but once you write using a wide character function, it's set to wide and you mustn't use byte functions on it any more (stderr remains unaffected). I can't think of any standard library functions that will use stdout, but external libraries might.Comras
@teppic: Ok, in my application anything else writing to stdout would be a bug anyway, and I measured performance dropping to half when using printf("%lc", wc) versus putwchar(wc) so I think I'll stick with wide output on stdout. Thanks by the way for linking to the relevant standard in your answer :-)Zachery
@Zachery - that was R.. :) I updated my answer with your question for completion.Comras
@teppic: Ah, I didn't notice. Thanks to you both :-)Zachery
C
11

So long as the locale is set correctly, there shouldn't be any issues processing UTF-8 files on a system using UTF-8, using the wide character functions. They'll be able to interpret things correctly, i.e. they'll treat a character as 1-4 bytes as necessary (in both input and output). You can test it out by something like this:

#include <stdio.h>
#include <locale.h>
#include <wchar.h>

int main()
{
    setlocale(LC_CTYPE, "en_GB.UTF-8");
    // setlocale(LC_CTYPE, ""); // to use environment variable instead
    wchar_t *txt = L"£Δᗩ";

    wprintf(L"The string %ls has %d characters\n", txt, wcslen(txt));
}

$ gcc -o loc loc.c && ./loc
The string £Δᗩ has 3 characters

If you use the standard functions (in particular character functions) on multibyte strings carelessly, things will start to break, e.g. the equivalent:

char *txt = "£Δᗩ";
printf("The string %s has %zu characters\n", txt, strlen(txt));

$ gcc -o nloc nloc.c && ./nloc
The string £Δᗩ has 7 characters

The string still prints correctly here because it's essentially just a stream of bytes, and as the system is expecting UTF-8 sequences, they're translated perfectly. Of course strlen is reporting the number of bytes in the string, 7 (plus the \0), with no understanding that a character and a byte aren't equivalent.

In this respect, because of the compatibility between ASCII and UTF-8, you can often get away with treating UTF-8 files as simply multibyte C strings, as long as you're careful.

There's a degree of flexibility as well. It's possible to convert a standard C string (as a multibyte string) to a wide character string easily:

char *stdtxt = "ASCII and UTF-8 €£¢";
wchar_t buf[100]; 
mbstowcs(buf, stdtxt, 20);

wprintf(L"%ls has %zu wide characters\n", buf, wcslen(buf));

Output:
ASCII and UTF-8 €£¢ has 19 wide characters

Once you've used a wide character function on a stream, it's set to wide orientation. If you later want to use standard byte i/o functions, you'll need to re-open the stream first. This is probably why the recommendation is not to use it on stdout. However, if you only use wide character functions on stdin and stdout (including any code that you link to), you will not have any problems.

Comras answered 16/3, 2013 at 23:21 Comment(5)
'Break' is not quite right. The description should be 'The string occupies 7 bytes', which is accurate. That it contains only 3 characters is also correct. This is a difference in part between multi-byte strings (mbs* functions) and wide-character strings (wcs* functions). However, that's nitpicking; your core answer is fine.Marginate
@JonathanLeffler - I was just editing to address what I said as you wrote that.Comras
@JonathanLeffler - heh, that's ok. I've filled it out a bit.Comras
@teppic: Thanks for the examples, your answer was also very good.Zachery
Use strnlen, not strlen.Burgrave
C
10

The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02

Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE locale category) at the time the FILE stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide is used to set the orientation to wide. So as long as a proper LC_CTYPE locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.

However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr is byte oriented (and some even makes the same assumption about stdout), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.

Really, I can't think of any reason at all to use wide-oriented functions. fprintf is perfectly capable of sending wide-character strings to byte-oriented FILE streams using the %ls specifier.

Centillion answered 17/3, 2013 at 3:34 Comment(5)
I assume using putwchar(wc) yields better performance than having to use printf("%lc", wc), but for my current use that performance difference probably isn't important. But just to be clear, setting stdout to wide orientation will only be problematic if library functions actually write to stdout, correct?Zachery
@Quantumboredom: Yes. stdout begins with no orientation, but once you write using a wide character function, it's set to wide and you mustn't use byte functions on it any more (stderr remains unaffected). I can't think of any standard library functions that will use stdout, but external libraries might.Comras
@teppic: Ok, in my application anything else writing to stdout would be a bug anyway, and I measured performance dropping to half when using printf("%lc", wc) versus putwchar(wc) so I think I'll stick with wide output on stdout. Thanks by the way for linking to the relevant standard in your answer :-)Zachery
@Zachery - that was R.. :) I updated my answer with your question for completion.Comras
@teppic: Ah, I didn't notice. Thanks to you both :-)Zachery
L
-1

Don't use fputs with anything else than ASCII.

If you want to write down lets say UTF8, then use a function who return the real size used by the utf8 string and use fwrite to write the good number of bytes, without worrying of vicious '\0' inside the string.

Lunneta answered 16/3, 2013 at 23:5 Comment(7)
Welcome to Stack Overflow. fputs() outputs a byte string up to the first zero byte. UTF-8 contains only one character value with a zero byte, and that's U+0000 (encoded as '\0' in UTF-8). So fputs() won't mishandle a null-terminated UTF-8 string. Indeed, one of the merits of UTF-8 is that a naîve program that is unaware of UTF-8 can often handle the strings correctly even so. (Not always — there are plenty of ways to cause trouble; but often...) Also, fputs() is fine for single-byte codesets such as ISO 8859-1 or 8859-15 (8859-2, ...). Limiting it to ASCII is unjustifiably stringent.Marginate
Hi, He is not compiling as full utf8. He's using utf8 string in ascii compiled source.Lunneta
And fputs will fail because utf8 are not one byte encoded strings.Lunneta
Better than that he should use wchar and fputws(const wchar_t *restrict, FILE *restrict);Lunneta
fputs() will not fail just because UTF8 is a multi-byte code set. Indeed, one of the goals of the design of UTF8 was to let naïve programs that are unaware of UTF8 still process it successfully. Your assertion that fputs() is only good for ASCII is blatantly wrong, even taking a charitable interpretation that you mean 'a single-byte code set based on ASCII, such as 8859-1'. Note that UTF8 is a multi-byte code set (or character encoding), not one that uses wide characters; you would not handle UTF8 with wide-character functions. UTF16 and UTF32 are wide-character representations of Unicode.Marginate
Now, if you are trying to argue that you cannot use fputs() to output a wide-character string, then I'd agree with you, but that isn't what your answer says at all. Your answer largely avoids answering the actual question, in fact.Marginate
My answer says what it says: writing will fail. Which mean the behavior will not be what attended. Why: because of encoding. Plus I'm not the one arguing there. You are. \0 is zero. Lots of multibytes ancoded characters will have single 0 in a byte which mean no more writting with fputs.Lunneta

© 2022 - 2024 — McMap. All rights reserved.