Is it possible to convert UTF8 string in a std::string to std::wstring and vice versa in a platform independent manner? In a Windows application I would use MultiByteToWideChar and WideCharToMultiByte. However, the code is compiled for multiple OSes and I'm limited to standard C++ library.
I've asked this question 5 years ago. This thread was very helpful for me back then, I came to a conclusion, then I moved on with my project. It is funny that I needed something similar recently, totally unrelated to that project from the past. As I was researching for possible solutions, I stumbled upon my own question :)
The solution I chose now is based on C++11. The boost libraries that Constantin mentions in his answer are now part of the standard. If we replace std::wstring with the new string type std::u16string, then the conversions will look like this:
UTF-8 to UTF-16
std::string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::u16string dest = convert.from_bytes(source);
UTF-16 to UTF-8
std::u16string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string dest = convert.to_bytes(source);
As seen from the other answers, there are multiple approaches to the problem. That's why I refrain from picking an accepted answer.
codecvt_utf8_utf16
is deprecated too, by the way (and, no, there is no replacement either). –
Scudo The problem definition explicitly states that the 8-bit character encoding is UTF-8. That makes this a trivial problem; all it requires is a little bit-twiddling to convert from one UTF spec to another.
Just look at the encodings on these Wikipedia pages for UTF-8, UTF-16, and UTF-32.
The principle is simple - go through the input and assemble a 32-bit Unicode code point according to one UTF spec, then emit the code point according to the other spec. The individual code points need no translation, as would be required with any other character encoding; that's what makes this a simple problem.
Here's a quick implementation of wchar_t
to UTF-8 conversion and vice versa. It assumes that the input is already properly encoded - the old saying "Garbage in, garbage out" applies here. I believe that verifying the encoding is best done as a separate step.
std::string wchar_to_UTF8(const wchar_t * in)
{
std::string out;
unsigned int codepoint = 0;
for (in; *in != 0; ++in)
{
if (*in >= 0xd800 && *in <= 0xdbff)
codepoint = ((*in - 0xd800) << 10) + 0x10000;
else
{
if (*in >= 0xdc00 && *in <= 0xdfff)
codepoint |= *in - 0xdc00;
else
codepoint = *in;
if (codepoint <= 0x7f)
out.append(1, static_cast<char>(codepoint));
else if (codepoint <= 0x7ff)
{
out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else if (codepoint <= 0xffff)
{
out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else
{
out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
codepoint = 0;
}
}
return out;
}
The above code works for both UTF-16 and UTF-32 input, simply because the range d800
through dfff
are invalid code points; they indicate that you're decoding UTF-16. If you know that wchar_t
is 32 bits then you could remove some code to optimize the function.
std::wstring UTF8_to_wchar(const char * in)
{
std::wstring out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (sizeof(wchar_t) > 2)
out.append(1, static_cast<wchar_t>(codepoint));
else if (codepoint > 0xffff)
{
codepoint -= 0x10000;
out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
}
else if (codepoint < 0xd800 || codepoint >= 0xe000)
out.append(1, static_cast<wchar_t>(codepoint));
}
}
return out;
}
Again if you know that wchar_t
is 32 bits you could remove some code from this function, but in this case it shouldn't make any difference. The expression sizeof(wchar_t) > 2
is known at compile time, so any decent compiler will recognize dead code and remove it.
wchar_t
. I've updated the answer. –
Rossuck wchar_t
, which is an opaque data type that represents a Unicode character, you are not allowed to make any assumptions about its internal representation, the only thing you can do is call provided library functions on it, like wctomb
, which will encode the character using current system locale encoding. –
Scudo wchar_t
, any error checking you can add will be incorrect and will be triggered by valid inputs on some platforms/implementations (potentially). –
Scudo wchar_t
is that it holds a range of integers appropriate for the platform on which it is compiled. –
Rossuck wchar_t
s, which are abstract code points (according to the C standard), and the concept of encoding (UTF-16 or UTF-32) does not apply to them in any meaningful way. I see what you meant now: basically, this code works both with wchar_t
that represent all of Unicode and platforms like Windows that hack wchar_t
for their purposes. –
Scudo wchar_t
was intended to hold code points, but that's not how it worked out in practice. As a concrete example when Windows first got Unicode all the codepoints fit into 16 bits so wchar_t
was made a 16-bit integer. Later when Unicode was extended they were forced to use UTF-16 encoding to make it work, and that's what Windows uses to this day with wchar_t
still 16 bits. Don't look down on Windows, their problems stem from being an early adopter and they aren't alone. –
Rossuck UTF8_to_wchar
I found that in the else if (codepoint > 0xffff)
case I needed to have (0xd7c0 + (codepoint >> 10))
in place of (0xd800 + (codepoint >> 10))
. I don't think that has anything to do with Nim, rather I'm wondering if it's a mistake that should be corrected in your answer. Thanks for your work on this, it's very helpful! –
Lambaste U+1f600
agrees with your observation. There's a mystery here. –
Rossuck 0x10000>>10
is 0x40
, so mathematically your fix is the same as mine. Mine is easier to reconcile with the official algorithm description though. Amazing how a bug can go uncaught for 14 years. –
Rossuck You can extract utf8_codecvt_facet
from Boost serialization library.
Their usage example:
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// Send the UCS-4 data out, converting to UTF-8
{
std::wofstream ofs("data.ucd");
ofs.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
}
// Read the UTF-8 data back in, converting to UCS-4 on the way in
std::vector<ucs4_t> from_file;
{
std::wifstream ifs("data.ucd");
ifs.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) from_file.push_back(item);
}
Look for utf8_codecvt_facet.hpp
and utf8_codecvt_facet.cpp
files in boost sources.
There are several ways to do this, but the results depend on what the character encodings are in the string
and wstring
variables.
If you know the string
is ASCII, you can simply use wstring
's iterator constructor:
string s = "This is surely ASCII.";
wstring w(s.begin(), s.end());
If your string
has some other encoding, however, you'll get very bad results. If the encoding is Unicode, you could take a look at the ICU project, which provides a cross-platform set of libraries that convert to and from all sorts of Unicode encodings.
If your string
contains characters in a code page, then may $DEITY have mercy on your soul.
You can use the codecvt
locale facet. There's a specific specialisation defined, codecvt<wchar_t, char, mbstate_t>
that may be of use to you, although, the behaviour of that is system-specific, and does not guarantee conversion to UTF-8 in any way.
encoding
instead of locale
. As far as I can tell, there is no such a locale which can represent every single unicode character. Let's say I want to encode a string which contains all of the unicode characters, which locale do you sugguest me to configure? Corret me if I am wrong. –
Dagnydago Created my own library for utf-8 to utf-16/utf-32 conversion - but decided to make a fork of existing project for that purpose.
https://github.com/tapika/cutf
(Originated from https://github.com/noct/cutf )
API works with plain C as well as with C++.
Function prototypes looks like this: (For full list see https://github.com/tapika/cutf/blob/master/cutf.h )
//
// Converts utf-8 string to wide version.
//
// returns target string length.
//
size_t utf8towchar(const char* s, size_t inSize, wchar_t* out, size_t bufSize);
//
// Converts wide string to utf-8 string.
//
// returns filled buffer length (not string length)
//
size_t wchartoutf8(const wchar_t* s, size_t inSize, char* out, size_t outsize);
#ifdef __cplusplus
std::wstring utf8towide(const char* s);
std::wstring utf8towide(const std::string& s);
std::string widetoutf8(const wchar_t* ws);
std::string widetoutf8(const std::wstring& ws);
#endif
Sample usage / simple test application for utf conversion testing:
#include "cutf.h"
#define ok(statement) \
if( !(statement) ) \
{ \
printf("Failed statement: %s\n", #statement); \
r = 1; \
}
int simpleStringTest()
{
const wchar_t* chineseText = L"主体";
auto s = widetoutf8(chineseText);
size_t r = 0;
printf("simple string test: ");
ok( s.length() == 6 );
uint8_t utf8_array[] = { 0xE4, 0xB8, 0xBB, 0xE4, 0xBD, 0x93 };
for(int i = 0; i < 6; i++)
ok(((uint8_t)s[i]) == utf8_array[i]);
auto ws = utf8towide(s);
ok(ws.length() == 2);
ok(ws == chineseText);
if( r == 0 )
printf("ok.\n");
return (int)r;
}
And if this library does not satisfy your needs - feel free to open following link:
and scroll down at the end of page and pick up any heavier library which you like.
if you using c++17 and later
Standard library header (C++11)(deprecated in C++17)(removed in C++26)
You can use std::filesystem::path
template<typename T>
constexpr std::basic_string<T> convert_string(const std::filesystem::path& str){
if constexpr(std::is_same_v<T, char>)
{
return str.string();
}
else if (std::is_same_v<T, char8_t>) {
return str.u8string();
}
else if (std::is_same_v<T, char16_t>) {
return str.u16string();
}
else if (std::is_same_v<T, char32_t>) {
return str.u32string();
}
}
but this method can throw an exception in runtime if the conversion failed.
I don't think there's a portable way of doing this. C++ doesn't know the encoding of its multibyte characters.
As Chris suggested, your best bet is to play with codecvt.
© 2022 - 2024 — McMap. All rights reserved.
std::wstring
isstd::basic_string<wchar_t>
.wchar_t
is an opaque data type that represents a Unicode character (the fact that on Windows it is 16 bits long only means that Windows does not follow the standard). There is no “encoding” for abstract Unicode characters, they are just characters. – Scudo