Multi-Byte UTF-8 in Arrays in C++
Asked Answered
W

1

3

I have been having trouble working with 3-byte Unicode UTF-8 characters in arrays. When they are in char arrays I get multi-character character constant and implicit constant conversion warnings, but when I use wchar_t arrays, wcout returns nothing at all. Because of the nature of the project, it must be an array and not a string. Below is an example of what I've been trying to do.

#include <iostream>
#include <string>
using namespace std;
int main()
{
    wchar_t testing[40];
    testing[0] = L'\u0B95';
    testing[1] = L'\u0BA3';
    testing[2] = L'\u0B82';
    testing[3] = L'\0';
    wcout << testing[0] << endl;
    return 0;
}

Any suggestions? I'm working with OSX.

Wad answered 24/11, 2012 at 23:17 Comment(3)
When you store them in char arrays, such a code point would take three chars. Multi-character character constants are an entirely different thing.Halcomb
wstring are not utf8 (they are not necessarily UTF-16 nor UCS4). You don't know what encoding they are, so writing fixed values inide them is asking for trouble.Ruthie
They don't have any encoding. They are just some bytes.Particularism
A
4

Since '\u0B95' requires 3 bytes, it is considered a multicharacter literal. A multicharacter literal has type int and an implementation-defined value. (Actually, I don't think gcc is correct to do this)

Putting the L prefix before the literal makes it have type wchar_t and has an implementation defined value (it maps to a value in the execution wide-character set which is an implementation defined superset of the basic execution wide-character set).

The C++11 standard provides us with some more Unicode aware types and literals. The additional types are char16_t and char32_t, whose values are the Unicode code-points that represent the character. They are analogous to UTF-16 and UTF-32 respectively.

Since you need character literals to store characters from the basic multilingual plane, you'll need a char16_t literal. This can be written as, for example, u'\u0B95'. You can therefore write your code as follows, with no warnings or errors:

char16_t testing[40];
testing[0] = u'\u0B95';
testing[1] = u'\u0BA3';
testing[2] = u'\u0B82';
testing[3] = u'\0';

Unfortunately, the I/O library does not play nicely with these new types.

If you do not truly require using character literals as above, you may make use of the new UTF-8 string literals:

const char* testing = u8"\u0B95\u0BA3\u0B82";

This will encode the characters as UTF-8.

Aerogram answered 24/11, 2012 at 23:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.