As of C++11, there are additional standard codecvt
specialisations and types, intended for converting between various UTF-x and UCSx character sequences; one of these may suit your needs.
In <locale>
:
std::codecvt<char16_t, char, std::mbstate_t>
: Converts between UTF-16 and UTF-8.
std::codecvt<char32_t, char, std::mbstate_t>
: Converts between UTF-32 and UTF-8.
In <codecvt>
:
std::codecvt_utf8_utf16<typename Elem>
: Converts between UTF-8 and UTF-16, where UTF-16 code points are stored as the specified Elem
(note that if char32_t
is specified, only one code point will be stored per char32_t
).
- Has two additional, defaulted template paramters (
unsigned long MaxCode = 0x10ffff
, and std::codecvt_mode Mode = (std::codecvt_mode)0
), and inherits from std::codecvt<Elem, char, std::mbstate_t>
.
std::codecvt_utf8<typename Elem>
: Converts between UTF-8 and either UCS2 or UCS4, depending on Elem
(UCS2 for char16_t
, UCS4 for char32_t
, platform-dependent for wchar_t
).
- Has two additional, defaulted template paramters (
unsigned long MaxCode = 0x10ffff
, and std::codecvt_mode Mode = (std::codecvt_mode)0
), and inherits from std::codecvt<Elem, char, std::mbstate_t>
.
std::codecvt_utf16<typename Elem>
: Converts between UTF-16 and either UCS2 or UCS4, depending on Elem
(UCS2 for char16_t
, UCS4 for char32_t
, platform-dependent for wchar_t
).
- Has two additional, defaulted template paramters (
unsigned long MaxCode = 0x10ffff
, and std::codecvt_mode Mode = (std::codecvt_mode)0
), and inherits from std::codecvt<Elem, char, std::mbstate_t>
.
codecvt_utf8
and codecvt_utf16
will convert between the specified UTF and either UCS2 or UCS4, depending on the size of Elem
. Therefore, wchar_t
will specify UCS2 on systems where it's 16- to 31-bit (such as Windows, where it's 16-bit), or UCS4 on systems where it's at least 32-bit (such as Linux, where it's 32-bit), regardless of whether wchar_t
strings actually use that encoding; on platforms that use different encodings for wchar_t
strings, this will understandably cause problems if you aren't careful.
For more information, see CPP Reference:
Note that support for header codecvt
was only added to libstdc++
relatively recently. If using an older version of Clang or GCC, you may have to use libc++
, if you want to use it.
Note that versions of Visual Studio prior to 2015 don't actually support char16_t
and char32_t
; if these types exist on previous versions, it will be as typedefs for unsigned short
and unsigned int
, respectively. Also note that older versions of Visual Studio can have trouble converting strings between UTF encodings sometimes, and that Visual Studio 2015 has a glitch that prevents codecvt
from working properly with char16_t
and char32_t
, requiring the use of same-sized integral types instead