Conversion from wstring to u16string and back (standard conform) in C++17 / C++20
Asked Answered
M

1

8

My main platform is Windows which is the reason why I use internally UTF-16 (mostly BMP strings). I would like to use console output for these strings.

Unfortunately there is no std::u16cout or std::u8cout so I need to use std::wcout. Therefore I must convert my u16strings to wstrings - what is the best (and easiest) way to do that?

On Windows I know that wstring points to UTF16 data, so I can create a simple std::u16string_view which uses the same data (no conversion). But on Linux wstring is usually UTF32... Is there a way to do that without macros and without things like assuming sizeof(wchar_t) == 2 => utf16?

Minnich answered 20/4, 2020 at 13:19 Comment(3)
If you're on not-Windows, shouldn't you be using std::cout and std::string, not std::wcout and std::wstring? That is, shouldn't the conversion be to UTF-8, which is ubiquitous on not-Windows platforms?Evanesce
He's probably manipulating data generated by Windows applications and generated for Windows applications on a Linux server or something like that. edit: ah, he's working with BMP strings. There's your reasonFailure
Does this answer your question? how can I convert wstring to u16string?Ammerman
F
2

There is nothing in the C++20 standard that converts wchar_t to char32_t and back. After all, wchar_t is supposed to be large enough to contain any supported code point.

And indeed everywhere Unicode above U+FFFF is supported, wchar_t is 32-bit, except on Windows (and in Java, but that's irrelevant). So yes, even today working with Unicode in a portable way is problematic, and sizeof(wchar_t)==2 or #ifdef _WIN32 both sound like legitimate workarounds.

Having said that, wcout still seamlessly works with wchar_t on all platforms regardless of the underlying encoding.

It is only if you cut wstrings or work with individual code points and you want to support code points beyond the basic plane, then you need to take surrogate pairs into account (which is pretty easy still, 0xD800–0xDBFF = first pair, 0xDC00–0xDFFF = second pair, don't cut in between).

Fey answered 20/4, 2020 at 15:21 Comment(3)
I think it's also important to note that char32_t only represents a code point and not a grapheme. If you need to work with actual rendered graphemes that requires a specialized library. This is complicated... just a wee bit.Amargo
Yes, Unicode also has c̮oͣm̥bͮi̪n̆ìnͨǵ čh̎a͏r̷a͍c̘t́èr̗sͥ...Fey
I did it now with your workaround... Not nice but it works :-)Minnich

© 2022 - 2024 — McMap. All rights reserved.