UTF-8 Compatibility in C++
Asked Answered
W

4

20

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.

Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)

Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?

Wartburg answered 20/8, 2012 at 15:25 Comment(5)
For UTF wchar_t is recommended storage.You can store UTF-8 in char without issue but results will be weird.Hydrocellulose
@Anonymous that depends on your platform (and on which flavor of UTF you're interested in). On Windows, wchar_t is a good fit for UTF-16. On Linux, it is appropriate to use for UTF-32. For UTF-8, char is a pretty reasonable candidate (unless you've got access to the "new" character types in C++11)Muenster
This program will be ported across platforms. Which character type can be used best for that purpose?Wartburg
@user1563613, if you get a thrid-party library like ICU to deal with Unicode strings (which you really should), it will define safe data types that will work the same across all supported platforms.Authoritarian
Unless either you only ever store a few hundred characters or south-east Asia is the main market, UTF-8 is the best thing to use. UTF-16 has no (real) advantages and all the disadvantages of UTF-8. UTF-32 on the other hand, has forbidding memory requirements for everybody except the Chinese (apart from being Unicode in the first place, the big disadvantage that all UTFs share). Yes, it's a pain having to convert UTF-8 to UTF-16 before calling Win32 API functions, get over it. It works for everyone, it has no funny character sizes, and it has reasonable memory requirements for everyone.Quinonoid
A
21

To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].

Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.

Adversity answered 20/8, 2012 at 15:29 Comment(9)
I was using the prefix L so far. I tried replacing it with u8 but I get the error error C2065: 'u8' : undeclared identifier.Wartburg
@user1563613 It is possible that your compiler doesn't support u8 yet. Is it Visual Studio? If so you should probably use UTF-16, which is what the Windows APIs use.Adversity
It is Visual studio 2010. If I use UTF-16 I have to specify the endianess, correct? If so, wouldn't that be a problem when porting this program to other platforms?Wartburg
@user1563613 the endianness only matters when serializing. In memory you just use 16-bit sized types and the platform uses the appropriate endianness.Adversity
But if the input text files are stored with a specific endianess, and if the program was to access these same files from different platforms, wouldn't the program not work in some cases?Wartburg
@user1563613 in that case, yes, you need to fix the endianness of the input (but it doesn't matter once it is read).Adversity
UTF-32 is a fixed length encoding for code points, but Unicode is a fundamentally variable length representation of characters in that multiple code points can be used to represent a character. Random access for characters is not possible, whether you use UTF-32 or anything else. Fortunately random access is rarely (if ever) needed.Hokku
but why wchar_t can simply store a UTF-16 encoded character without any prefix like u8?Inflate
Since the C++ standard changed u8 literals to use char8_t they've become unfit for normal usage and I've had to remove them from any place I had tried them.Hokku
H
12

Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.

However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.

This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.

Edit: More recent versions of Microsoft's C++ compiler support UTF-8 source and using UTF-8 as the execution encoding. Windows also has at least a beta setting to use UTF-8 as the system locale encoding. See here.

The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".

There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.


UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.


The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.

If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.

If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)

Hokku answered 20/8, 2012 at 16:0 Comment(1)
Note that you can override the source and execution character set with the /utf-8 flag on MSVC: learn.microsoft.com/en-us/cpp/build/reference/…Easter
F
3

You can use char as a UTF-8 code unit and in fact this is the default on many platforms, including macOS and various flavors of Linux. Even on Windows/MSVC it is better to use normal char strings than u8/char8_t because the latter may result in silent corruption. Consider the following example (https://godbolt.org/z/PbGcxcfa6):

template <typename T>
void f(T);

int main() {
  f("∞");
  f(u8"∞");
}

With the default compiler settings this compiles to:

$SG2781 DB        0e2H, 088H, 09eH, 00H
$SG2782 DB        0c3H, 0a2H, 0cbH, 086H, 0c5H, 0beH, 00H

main    PROC
$LN3:
        sub     rsp, 40                             ; 00000028H
        lea     rcx, OFFSET FLAT:$SG2781
        call    void f<char const *>(char const *)                    ; f<char const *>
        lea     rcx, OFFSET FLAT:$SG2782
        call    void f<char const *>(char const *)                    ; f<char const *>
        xor     eax, eax
        add     rsp, 40                             ; 00000028H
        ret     0
main    ENDP

Notice that the normal char string contains the correct UTF-8 representation of "∞" (0e2H, 088H, 09eH, 00H) while the u8 string contains mojibake.

at and substr operate on code unit level and whether it's correct or not depend on the use case. In many cases you need to operate on code point or grapheme cluster level and both of these may consist of multiple code units/chars. But for simple cases such as searching for a substring code units can be enough.

The warning is pretty meaningless because the compiler code page has no effect on representation of normal string literals and has nothing to do with code pages/encodings at runtime. You can suppress it by setting the codepage to UTF-8 which is a good idea anyway.

Flagg answered 27/2, 2023 at 18:13 Comment(0)
A
1

The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.

If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).

I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

Authoritarian answered 20/8, 2012 at 15:28 Comment(4)
The worst part is that it would not end up with a replacement character. Breaking a sequence of UTF-8 bytes in the wrong place with substr simply results in an invalid sequence. To get replacement characters you need to validate and replace them manually.Adversity
@R.MartinhoFernandes, indeed. However, I would believe that by the time the data's presented to the user, some layer of the stack will have done the job. (Still, as you noted, it will remain uncorrected in the C++ program.)Authoritarian
So how would I go about properly getting the substrings or iterating over characters?Wartburg
@user1563613, there is no standard C++ API afaik. You're not the first to ask the question, though; you can see here for some solutions.Authoritarian

© 2022 - 2024 — McMap. All rights reserved.