How to convert UTF-8 std::string to UTF-16 std::wstring?

C

7

36

If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.

Children answered 22/8, 2011 at 21:40 Comment(2)

See #148903 among others. – Evania 22/8, 2011 at 21:44

possible duplicate of how can I compare utf8 string such as persian words in c++? or this. – Leavings 22/8, 2011 at 21:47

A

31

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8)
{
    std::vector<unsigned long> unicode;
    size_t i = 0;
    while (i < utf8.size())
    {
        unsigned long uni;
        size_t todo;
        bool error = false;
        unsigned char ch = utf8[i++];
        if (ch <= 0x7F)
        {
            uni = ch;
            todo = 0;
        }
        else if (ch <= 0xBF)
        {
            throw std::logic_error("not a UTF-8 string");
        }
        else if (ch <= 0xDF)
        {
            uni = ch&0x1F;
            todo = 1;
        }
        else if (ch <= 0xEF)
        {
            uni = ch&0x0F;
            todo = 2;
        }
        else if (ch <= 0xF7)
        {
            uni = ch&0x07;
            todo = 3;
        }
        else
        {
            throw std::logic_error("not a UTF-8 string");
        }
        for (size_t j = 0; j < todo; ++j)
        {
            if (i == utf8.size())
                throw std::logic_error("not a UTF-8 string");
            unsigned char ch = utf8[i++];
            if (ch < 0x80 || ch > 0xBF)
                throw std::logic_error("not a UTF-8 string");
            uni <<= 6;
            uni += ch & 0x3F;
        }
        if (uni >= 0xD800 && uni <= 0xDFFF)
            throw std::logic_error("not a UTF-8 string");
        if (uni > 0x10FFFF)
            throw std::logic_error("not a UTF-8 string");
        unicode.push_back(uni);
    }
    std::wstring utf16;
    for (size_t i = 0; i < unicode.size(); ++i)
    {
        unsigned long uni = unicode[i];
        if (uni <= 0xFFFF)
        {
            utf16 += (wchar_t)uni;
        }
        else
        {
            uni -= 0x10000;
            utf16 += (wchar_t)((uni >> 10) + 0xD800);
            utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
        }
    }
    return utf16;
}

Aleenaleetha answered 22/8, 2011 at 22:13 Comment(7)

thank You! thank You! it worked... I cant believe it :) thank You for your time john – Children 22/8, 2011 at 22:23

Really glad it helped. It really is just a matter of asking the right question. There's a lot of knowledge on this forum, but newbies often can't access that knowledge because they don't know what to ask. – Aleenaleetha 22/8, 2011 at 22:30

@aliakbarian: I've actually just spotted a minor bug in my code, you probably should copy it again. I changed this if (j == utf8.size()) to this if (i == utf8.size()). – Aleenaleetha 22/8, 2011 at 22:39

Note: this is windows only. Unix system use 32bit for wchar_t Alltho you can still do std::wstring wstr(str.begin(), str.end()); on Windows. – Aylesbury 22/11, 2019 at 16:49

@coo Sure, that's possible. If your goal is to trash your data. Simply widening every UTF-8 code unit to fit into a UTF-16 code unit does not magically convert between those encodings. This will just produce gibberish for any code unit in the input sequence that doesn't happen to encode an ASCII code point. – Deterge 28/6, 2020 at 16:37

great job, thanx, i've used it to convert fpc string(via @str[1]) to c++ wstring – Interlunar 19/2, 2022 at 10:18

This code allows, for example, invalid overlong encodings through, and having an intermediate copy means gratuitously copying each codepoint. I think it's better to use a carefully developed and tested algorithm. I would recommend the implementation in the [boost library] (github.com/boostorg/nowide) - it has a freestanding version so you can use it independently of the rest of boost. – Protagonist 7/3, 2022 at 21:18

P

54

This is how you do it with C++11:

std::string str = "your string in utf8";
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter;
std::wstring wstr = converter.from_bytes(str);

And these are the headers you need:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

A more complete example available here: http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

Pirtle answered 14/7, 2016 at 20:7 Comment(4)

Great answer, thanks! ...but do follow the example at cppreference.com. wchar_t is not a 16-bit type on operating systems other than Windows. You need to use char16_t instead. – Drais 26/3, 2017 at 18:30

@CrisLuengo thanks! 👍 I updated the answer to use char16_t instead. – Pirtle 27/3, 2017 at 12:24

Not working with g++ 6.2 or clang++ 3.8 on lubuntu 16.04 – Stoicism 8/5, 2017 at 17:55

Unfortunately, this was deprecated in C++17. mariusbancila.ro/blog/2018/07/05/… – Lotti 11/9, 2019 at 18:53

A

31

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8)
{
    std::vector<unsigned long> unicode;
    size_t i = 0;
    while (i < utf8.size())
    {
        unsigned long uni;
        size_t todo;
        bool error = false;
        unsigned char ch = utf8[i++];
        if (ch <= 0x7F)
        {
            uni = ch;
            todo = 0;
        }
        else if (ch <= 0xBF)
        {
            throw std::logic_error("not a UTF-8 string");
        }
        else if (ch <= 0xDF)
        {
            uni = ch&0x1F;
            todo = 1;
        }
        else if (ch <= 0xEF)
        {
            uni = ch&0x0F;
            todo = 2;
        }
        else if (ch <= 0xF7)
        {
            uni = ch&0x07;
            todo = 3;
        }
        else
        {
            throw std::logic_error("not a UTF-8 string");
        }
        for (size_t j = 0; j < todo; ++j)
        {
            if (i == utf8.size())
                throw std::logic_error("not a UTF-8 string");
            unsigned char ch = utf8[i++];
            if (ch < 0x80 || ch > 0xBF)
                throw std::logic_error("not a UTF-8 string");
            uni <<= 6;
            uni += ch & 0x3F;
        }
        if (uni >= 0xD800 && uni <= 0xDFFF)
            throw std::logic_error("not a UTF-8 string");
        if (uni > 0x10FFFF)
            throw std::logic_error("not a UTF-8 string");
        unicode.push_back(uni);
    }
    std::wstring utf16;
    for (size_t i = 0; i < unicode.size(); ++i)
    {
        unsigned long uni = unicode[i];
        if (uni <= 0xFFFF)
        {
            utf16 += (wchar_t)uni;
        }
        else
        {
            uni -= 0x10000;
            utf16 += (wchar_t)((uni >> 10) + 0xD800);
            utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
        }
    }
    return utf16;
}

Aleenaleetha answered 22/8, 2011 at 22:13 Comment(7)

thank You! thank You! it worked... I cant believe it :) thank You for your time john – Children 22/8, 2011 at 22:23

Really glad it helped. It really is just a matter of asking the right question. There's a lot of knowledge on this forum, but newbies often can't access that knowledge because they don't know what to ask. – Aleenaleetha 22/8, 2011 at 22:30

@aliakbarian: I've actually just spotted a minor bug in my code, you probably should copy it again. I changed this if (j == utf8.size()) to this if (i == utf8.size()). – Aleenaleetha 22/8, 2011 at 22:39

Note: this is windows only. Unix system use 32bit for wchar_t Alltho you can still do std::wstring wstr(str.begin(), str.end()); on Windows. – Aylesbury 22/11, 2019 at 16:49

@coo Sure, that's possible. If your goal is to trash your data. Simply widening every UTF-8 code unit to fit into a UTF-16 code unit does not magically convert between those encodings. This will just produce gibberish for any code unit in the input sequence that doesn't happen to encode an ASCII code point. – Deterge 28/6, 2020 at 16:37

great job, thanx, i've used it to convert fpc string(via @str[1]) to c++ wstring – Interlunar 19/2, 2022 at 10:18

This code allows, for example, invalid overlong encodings through, and having an intermediate copy means gratuitously copying each codepoint. I think it's better to use a carefully developed and tested algorithm. I would recommend the implementation in the [boost library] (github.com/boostorg/nowide) - it has a freestanding version so you can use it independently of the rest of boost. – Protagonist 7/3, 2022 at 21:18

H

2

There are some relevant Q&A here and here which is worth a read.

Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.

There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP

Herv answered 22/8, 2011 at 21:56 Comment(1)

If you're Windows only: msdn.microsoft.com/en-us/library/dd319072(v=VS.85).aspx. Otherwise, use a portable library. – Guttle 22/8, 2011 at 22:20

A

2

To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).

#include <string>

#include <codecvt>

int main()
{

    std::string original8 = u8"הלו";

    std::wstring original16 = L"הלו";

    //C++11 format converter
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;

    //convert to UTF8 and std::string
    std::string utf8NativeString = convert.to_bytes(original16);

    std::wstring utf16NativeString = convert.from_bytes(original8);

    assert(utf8NativeString == original8);
    assert(utf16NativeString == original16);

    return 0;
}

Acus answered 2/1, 2020 at 9:47 Comment(0)

H

0

Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.

A simple usage of it would look something like this on using namespace

utility::conversions

utf8_to_utf16("sample_string");

Hearty answered 8/7, 2020 at 23:6 Comment(0)

I

0

with winrt, you can easily convert std::string of utf8 to hstring(wchar) by winrt::to_hstring()

Ichang answered 19/5, 2024 at 17:54 Comment(0)

S

-1

This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx

In the comment section of that page, there are also some interesting suggestions for this task like:

// Get en ASCII std::string from anywhere
std::string sLogLevelA = "Hello ASCII-world!";

std::wstringstream ws;
ws << sLogLevelA.c_str();
std::wstring sLogLevel = ws.str();

Or

// To std::string:
str.assign(ws.begin(), ws.end());
// To std::wstring
ws.assign(str.begin(), str.end());

Though I'm not sure the validity of these approaches...

Seminary answered 23/8, 2011 at 8:14 Comment(1)

assign() is deffinetly not the way to convert UTF-8<->UTF-16. Don't try this at home – Epigastrium 21/2, 2020 at 12:56

Recommended topics

Hot tags