How to convert UTF-8 std::string to UTF-16 std::wstring?
Asked Answered
C

7

36

If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.

Children answered 22/8, 2011 at 21:40 Comment(2)
See #148903 among others.Evania
possible duplicate of how can I compare utf8 string such as persian words in c++? or this.Leavings
A
31

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8)
{
    std::vector<unsigned long> unicode;
    size_t i = 0;
    while (i < utf8.size())
    {
        unsigned long uni;
        size_t todo;
        bool error = false;
        unsigned char ch = utf8[i++];
        if (ch <= 0x7F)
        {
            uni = ch;
            todo = 0;
        }
        else if (ch <= 0xBF)
        {
            throw std::logic_error("not a UTF-8 string");
        }
        else if (ch <= 0xDF)
        {
            uni = ch&0x1F;
            todo = 1;
        }
        else if (ch <= 0xEF)
        {
            uni = ch&0x0F;
            todo = 2;
        }
        else if (ch <= 0xF7)
        {
            uni = ch&0x07;
            todo = 3;
        }
        else
        {
            throw std::logic_error("not a UTF-8 string");
        }
        for (size_t j = 0; j < todo; ++j)
        {
            if (i == utf8.size())
                throw std::logic_error("not a UTF-8 string");
            unsigned char ch = utf8[i++];
            if (ch < 0x80 || ch > 0xBF)
                throw std::logic_error("not a UTF-8 string");
            uni <<= 6;
            uni += ch & 0x3F;
        }
        if (uni >= 0xD800 && uni <= 0xDFFF)
            throw std::logic_error("not a UTF-8 string");
        if (uni > 0x10FFFF)
            throw std::logic_error("not a UTF-8 string");
        unicode.push_back(uni);
    }
    std::wstring utf16;
    for (size_t i = 0; i < unicode.size(); ++i)
    {
        unsigned long uni = unicode[i];
        if (uni <= 0xFFFF)
        {
            utf16 += (wchar_t)uni;
        }
        else
        {
            uni -= 0x10000;
            utf16 += (wchar_t)((uni >> 10) + 0xD800);
            utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
        }
    }
    return utf16;
}
Aleenaleetha answered 22/8, 2011 at 22:13 Comment(7)
thank You! thank You! it worked... I cant believe it :) thank You for your time johnChildren
Really glad it helped. It really is just a matter of asking the right question. There's a lot of knowledge on this forum, but newbies often can't access that knowledge because they don't know what to ask.Aleenaleetha
@aliakbarian: I've actually just spotted a minor bug in my code, you probably should copy it again. I changed this if (j == utf8.size()) to this if (i == utf8.size()).Aleenaleetha
Note: this is windows only. Unix system use 32bit for wchar_t Alltho you can still do std::wstring wstr(str.begin(), str.end()); on Windows.Aylesbury
@coo Sure, that's possible. If your goal is to trash your data. Simply widening every UTF-8 code unit to fit into a UTF-16 code unit does not magically convert between those encodings. This will just produce gibberish for any code unit in the input sequence that doesn't happen to encode an ASCII code point.Deterge
great job, thanx, i've used it to convert fpc string(via @str[1]) to c++ wstringInterlunar
This code allows, for example, invalid overlong encodings through, and having an intermediate copy means gratuitously copying each codepoint. I think it's better to use a carefully developed and tested algorithm. I would recommend the implementation in the [boost library] (github.com/boostorg/nowide) - it has a freestanding version so you can use it independently of the rest of boost.Protagonist
P
54

This is how you do it with C++11:

std::string str = "your string in utf8";
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter;
std::wstring wstr = converter.from_bytes(str);

And these are the headers you need:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

A more complete example available here: http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

Pirtle answered 14/7, 2016 at 20:7 Comment(4)
Great answer, thanks! ...but do follow the example at cppreference.com. wchar_t is not a 16-bit type on operating systems other than Windows. You need to use char16_t instead.Drais
@CrisLuengo thanks! 👍 I updated the answer to use char16_t instead.Pirtle
Not working with g++ 6.2 or clang++ 3.8 on lubuntu 16.04Stoicism
Unfortunately, this was deprecated in C++17. mariusbancila.ro/blog/2018/07/05/…Lotti
A
31

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8)
{
    std::vector<unsigned long> unicode;
    size_t i = 0;
    while (i < utf8.size())
    {
        unsigned long uni;
        size_t todo;
        bool error = false;
        unsigned char ch = utf8[i++];
        if (ch <= 0x7F)
        {
            uni = ch;
            todo = 0;
        }
        else if (ch <= 0xBF)
        {
            throw std::logic_error("not a UTF-8 string");
        }
        else if (ch <= 0xDF)
        {
            uni = ch&0x1F;
            todo = 1;
        }
        else if (ch <= 0xEF)
        {
            uni = ch&0x0F;
            todo = 2;
        }
        else if (ch <= 0xF7)
        {
            uni = ch&0x07;
            todo = 3;
        }
        else
        {
            throw std::logic_error("not a UTF-8 string");
        }
        for (size_t j = 0; j < todo; ++j)
        {
            if (i == utf8.size())
                throw std::logic_error("not a UTF-8 string");
            unsigned char ch = utf8[i++];
            if (ch < 0x80 || ch > 0xBF)
                throw std::logic_error("not a UTF-8 string");
            uni <<= 6;
            uni += ch & 0x3F;
        }
        if (uni >= 0xD800 && uni <= 0xDFFF)
            throw std::logic_error("not a UTF-8 string");
        if (uni > 0x10FFFF)
            throw std::logic_error("not a UTF-8 string");
        unicode.push_back(uni);
    }
    std::wstring utf16;
    for (size_t i = 0; i < unicode.size(); ++i)
    {
        unsigned long uni = unicode[i];
        if (uni <= 0xFFFF)
        {
            utf16 += (wchar_t)uni;
        }
        else
        {
            uni -= 0x10000;
            utf16 += (wchar_t)((uni >> 10) + 0xD800);
            utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
        }
    }
    return utf16;
}
Aleenaleetha answered 22/8, 2011 at 22:13 Comment(7)
thank You! thank You! it worked... I cant believe it :) thank You for your time johnChildren
Really glad it helped. It really is just a matter of asking the right question. There's a lot of knowledge on this forum, but newbies often can't access that knowledge because they don't know what to ask.Aleenaleetha
@aliakbarian: I've actually just spotted a minor bug in my code, you probably should copy it again. I changed this if (j == utf8.size()) to this if (i == utf8.size()).Aleenaleetha
Note: this is windows only. Unix system use 32bit for wchar_t Alltho you can still do std::wstring wstr(str.begin(), str.end()); on Windows.Aylesbury
@coo Sure, that's possible. If your goal is to trash your data. Simply widening every UTF-8 code unit to fit into a UTF-16 code unit does not magically convert between those encodings. This will just produce gibberish for any code unit in the input sequence that doesn't happen to encode an ASCII code point.Deterge
great job, thanx, i've used it to convert fpc string(via @str[1]) to c++ wstringInterlunar
This code allows, for example, invalid overlong encodings through, and having an intermediate copy means gratuitously copying each codepoint. I think it's better to use a carefully developed and tested algorithm. I would recommend the implementation in the [boost library] (github.com/boostorg/nowide) - it has a freestanding version so you can use it independently of the rest of boost.Protagonist
H
2

There are some relevant Q&A here and here which is worth a read.

Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.

There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP

Herv answered 22/8, 2011 at 21:56 Comment(1)
If you're Windows only: msdn.microsoft.com/en-us/library/dd319072(v=VS.85).aspx. Otherwise, use a portable library.Guttle
A
2

To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).

#include <string>

#include <codecvt>

int main()
{

    std::string original8 = u8"הלו";

    std::wstring original16 = L"הלו";

    //C++11 format converter
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;

    //convert to UTF8 and std::string
    std::string utf8NativeString = convert.to_bytes(original16);

    std::wstring utf16NativeString = convert.from_bytes(original8);

    assert(utf8NativeString == original8);
    assert(utf16NativeString == original16);

    return 0;
}
Acus answered 2/1, 2020 at 9:47 Comment(0)
H
0

Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.

A simple usage of it would look something like this on using namespace

utility::conversions

utf8_to_utf16("sample_string");
Hearty answered 8/7, 2020 at 23:6 Comment(0)
I
0

with winrt, you can easily convert std::string of utf8 to hstring(wchar) by winrt::to_hstring()

Ichang answered 19/5, 2024 at 17:54 Comment(0)
S
-1

This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx

In the comment section of that page, there are also some interesting suggestions for this task like:

// Get en ASCII std::string from anywhere
std::string sLogLevelA = "Hello ASCII-world!";

std::wstringstream ws;
ws << sLogLevelA.c_str();
std::wstring sLogLevel = ws.str();

Or

// To std::string:
str.assign(ws.begin(), ws.end());
// To std::wstring
ws.assign(str.begin(), str.end());

Though I'm not sure the validity of these approaches...

Seminary answered 23/8, 2011 at 8:14 Comment(1)
assign() is deffinetly not the way to convert UTF-8<->UTF-16. Don't try this at homeEpigastrium

© 2022 - 2025 — McMap. All rights reserved.