How to convert wstring into string?

Asked 26/1, 2011 at 11:58 Answered 15/4 at 22:26

277

The question is how to convert wstring to string?

I have next example :

#include <string>
#include <iostream>

int main()
{
    std::wstring ws = L"Hello";
    std::string s( ws.begin(), ws.end() );

  //std::cout <<"std::string =     "<<s<<std::endl;
    std::wcout<<"std::wstring =    "<<ws<<std::endl;
    std::cout <<"std::string =     "<<s<<std::endl;
}

the output with commented out line is :

std::string =     Hello
std::wstring =    Hello
std::string =     Hello

but without is only :

std::wstring =    Hello

Is anything wrong in the example? Can I do the conversion like above?

EDIT

New example (taking into account some answers) is

#include <string>
#include <iostream>
#include <sstream>
#include <locale>

int main()
{
    setlocale(LC_CTYPE, "");

    const std::wstring ws = L"Hello";
    const std::string s( ws.begin(), ws.end() );

    std::cout<<"std::string =     "<<s<<std::endl;
    std::wcout<<"std::wstring =    "<<ws<<std::endl;

    std::stringstream ss;
    ss << ws.c_str();
    std::cout<<"std::stringstream =     "<<ss.str()<<std::endl;
}

The output is :

std::string =     Hello
std::wstring =    Hello
std::stringstream =     0x860283c

therefore the stringstream can not be used to convert wstring into string.

Redeemable answered 26/1, 2011 at 11:58 Comment(21)

I get two lines of output with the first cout line commented out. It seems that something is askew with your environment. What OS and compiler are you on? – Guilt 26/1, 2011 at 12:4

How can you ask this question without specifying also the encodings? – Crisscross 26/1, 2011 at 12:15

consider not using std::string at all. std::wstring has tons of advantages; is it really necessary to demote to std::string? – Chromogen 26/1, 2011 at 12:27

@Marcelo fedora 9 (yeah, it is ancient, but I have no choice) – Archivolt 26/1, 2011 at 12:51

@David I am a complete nowb regarding the locales. Mind adding an answer how it should be? – Archivolt 26/1, 2011 at 12:53

@VJo It's a mess in C++ because there is no proper portable Unicode support. It's not even properly there in C++0x. But the main thing you need to get to grips with the encoding used by your strings. The wstring could be UTF-32 or UTF-16 maybe, or maybe UCS-2, I don't know. The string is most likely UTF-8 or one of the ISO 8 bit encodings. But only you can know the answers to these questions. – Crisscross 26/1, 2011 at 12:58

@tenfour: Why use std::wstring at all? #1050447 – Botulinus 26/1, 2011 at 13:14

@Botulinus If you have data that is already encoded with UTF-16, whether or not UTF-16 is considered harmful is somewhat moot. And for what it's worth, I don't think any transformation form is harmful; what is harmful is people thinking they understand Unicode when in fact they don't. – Crisscross 26/1, 2011 at 13:17

Does it have to be a cross-platform solution? – Confident 26/1, 2011 at 13:18

@sad_man If you can make one that is better. If not, I would prefer a linux solution. – Archivolt 26/1, 2011 at 13:22

Oops, I had one for Windows, not for Linux. Ok good luck. – Confident 26/1, 2011 at 13:25

@dalle: what has wstring to do with UTF-16? – Rahr 26/1, 2011 at 13:40

@Philipp: Absolutely nothing at all, although a lot of people incorrectly thinks that it has something to do with UTF-16. According to the C++ standard std::wstring cannot be UTF-16 encoded. – Botulinus 26/1, 2011 at 13:49

@Botulinus c++ standard doesn't mention utf in any way (utf-8 or utf-16). Got a link where it says why utf-16 can't encoded with wstring? – Archivolt 26/1, 2011 at 14:8

@VJo: C++ Standard 3.9.1 paragraph 5 states "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales." – Botulinus 26/1, 2011 at 15:32

@Botulinus does that mean wchar_t cannot be used with variable length encodings? – Crisscross 26/1, 2011 at 18:56

@David Heffernan: That is my interpretation. Although there are C++ implementations which does so anyway. – Botulinus 26/1, 2011 at 20:12

@Botulinus So if you want to write portable C++ and use UTF-16 what do you do. As far as I can see C++ is still in the dark ages with regards Unicode and not even C++0x will bring complete support for the standard Unicode locales. It's basically utterly rubbish! The committee, fine outstanding individuals that they are, really should get on top of this issue. – Crisscross 26/1, 2011 at 20:27

of course std::(w)string can contain UTF8 or UTF16, but other parts of the c++ standard library cant handle variable length encodings, most noticeable several locale facets. And of course if you do string manipulations with UTF8/16 strings (e.g. substr, resize, ...) you will have to check manually if all codepoints are still intact before outputting them.. – Hifi 26/1, 2011 at 21:6

@Chromogen opposite. utf8everywhere.org – Guffaw 25/2, 2015 at 7:2

https://github.com/Shilyx/charconv I think this lib is enough for win32 platform – Parvenu 25/9, 2018 at 2:6

Here is a worked-out solution based on the other suggestions:

#include <string>
#include <iostream>
#include <clocale>
#include <locale>
#include <vector>

int main() {
  std::setlocale(LC_ALL, "");
  const std::wstring ws = L"ħëłlö";
  const std::locale locale("");
  typedef std::codecvt<wchar_t, char, std::mbstate_t> converter_type;
  const converter_type& converter = std::use_facet<converter_type>(locale);
  std::vector<char> to(ws.length() * converter.max_length());
  std::mbstate_t state;
  const wchar_t* from_next;
  char* to_next;
  const converter_type::result result = converter.out(state, ws.data(), ws.data() + ws.length(), from_next, &to[0], &to[0] + to.size(), to_next);
  if (result == converter_type::ok or result == converter_type::noconv) {
    const std::string s(&to[0], to_next);
    std::cout <<"std::string =     "<<s<<std::endl;
  }
}

This will usually work for Linux, but will create problems on Windows.

Rahr answered 26/1, 2011 at 14:6 Comment(11)

@Phillip: which part of the code depend on the c-locale ? is the std::setlocale(LC_ALL, ""); really needed ? – Hifi 26/1, 2011 at 14:44

@smerlin: I'm using vector now. (I was too lazy to look whether vector is guaranteed to be contiguous even in C++03, but it is.) setlocale is only needed if you are using wcout because that uses the stdio locale. – Rahr 26/1, 2011 at 14:54

using std::wcout.imbue(locale) should do the job aswell, and it has the benefit that it does not change any global state. – Hifi 26/1, 2011 at 15:22

The std::wstring_convert from C++11 wraps up a lot of this noise. – Finnegan 27/9, 2011 at 19:34

@Philipp, what do you mean "will create problems on Windows"? What kind of problems? – Varicose 23/11, 2011 at 21:45

The above code gives (as copied) gives me a *** glibc detected *** test: malloc(): smallbin double linked list corrupted: 0x000000000180ea30 *** on linux 64-bit (gcc 4.7.3). Anybody else experiencing this? – Halation 10/11, 2013 at 12:22

The code above doesn't work on Linux (KUBUNTU) GCC 4.7 – Tenerife 27/7, 2014 at 12:24

@Halation maybe you can run it in valgrind and report a bug to the maintainers of the code part where the first violation happens ? – Guffaw 25/2, 2015 at 7:6

I am getting this error on GCC 4.8: "Invalid arguments Candidates are: const #0 & use_facet(const std::locale &) ". Can I get some help please ? – Kalikow 10/6, 2016 at 7:49

u got a typo "or" > if (result == converter_type::ok or result == converter_type::noconv) – Annoyance 20/1, 2017 at 9:43

g++ 7.2.0 on msys2 (mingw64): This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. terminate called after throwing an instance of 'std::runtime_error' what(): locale::facet::_S_create_c_locale name not valid – Loreeloreen 26/10, 2017 at 19:2

390

As Cubbi pointed out in one of the comments, std::wstring_convert (C++11) provides a neat simple solution (you need to #include <locale> and <codecvt>):

std::wstring string_to_convert;

//setup converter
using convert_type = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_type, wchar_t> converter;

//use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( string_to_convert );

I was using a combination of wcstombs and tedious allocation/deallocation of memory before I came across this.

http://en.cppreference.com/w/cpp/locale/wstring_convert

update(2013.11.28)

One liners can be stated as so (Thank you Guss for your comment):

std::wstring str = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes("some string");

Wrapper functions can be stated as so: (Thank you ArmanSchwarz for your comment)

std::wstring s2ws(const std::string& str)
{
    using convert_typeX = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_typeX, wchar_t> converterX;

    return converterX.from_bytes(str);
}

std::string ws2s(const std::wstring& wstr)
{
    using convert_typeX = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_typeX, wchar_t> converterX;

    return converterX.to_bytes(wstr);
}

Note: there's some controversy on whether string/wstring should be passed in to functions as references or as literals (due to C++11 and compiler updates). I'll leave the decision to the person implementing, but it's worth knowing.

Note: I'm using std::codecvt_utf8 in the above code, but if you're not using UTF-8 you'll need to change that to the appropriate encoding you're using:

http://en.cppreference.com/w/cpp/header/codecvt

Trite answered 22/8, 2013 at 7:57 Comment(16)

Please +1: this is the official C++ standard way to do string conversion. You can also use from_bytes to convert the other way. Because I personally like one-liners, here is my version: std::wstring str = std::wstring_convert<std::codecvt_utf<wchar_t>>().from_bytes("some string"); – Babblement 11/11, 2013 at 12:59

2 weeks I spent using giant dodgy templated monstrosities before I came across this. Thank you. Please consider wrapping in a simple std::string ws2s(std::wstring const&) function, might get more up-votes that way. – Airing 13/11, 2013 at 6:5

Looks like en.cppreference.com/w/cpp/header/codecvt isn't available as of g++ 4.8.2. The two s2ws and ws2s methods do not currently work under linux – Countrybred 10/9, 2014 at 11:34

works like a charm, but only on modern compilers, due missing <codecvt> header for older versions – Dessiedessma 13/5, 2016 at 12:28

This is the way to do it. – Gynous 9/3, 2017 at 0:53

Be aware: In VS15 .from_bytes("äüöß") causes an unhandled c++ exception in xlocbuf, line 426, while .from_bytes(u8"äüöß") works. So please add to the answer to use unicode std::string (using u8) – Toro 28/4, 2017 at 11:53

@RoiDanton: u8".." creates a narrow string (const char[]) that contains bytes that represent the text encoded using utf-8 encoding. ".." is also const char[] but its encoding is probably whatever character encoding you used for the source file (e.g., ANSI code page such as cp1252). You can't decode an arbitrary sequence of bytes using utf-8 encoding -- it may lead to the error that you've encountered (e.g., "äüöß".encode('cp1252').decode('utf-8') raises UnicodeDecodeError in Python). – Honna 29/4, 2017 at 17:49

@J.F.Sebastian The source file is encoded in UTF8 (w/o bom). Furthermore with "\u00E4\u00F6\u00FC\u00DF\u20AC\u0040" the same crash occurs, while - if prepended with u8 - converting the sequence works. – Toro 2/5, 2017 at 11:16

@RoiDanton: it is even worse then: you use the encoding that is incompatible with your compiler configuration otherwise the result would be the same as u8"" (but less portable). Note: you don't need the source code encoding to be utf-8 in order to use u8"" -- it is responsibility of the compiler to convert from the source code encoding to utf-8 i.e., the encoding used for the strings in the executable may be different from the source code encoding (u8"" = source_bytes.decode(source_code_encoding).encode('utf-8'); "" = source_bytes.decode(source_code_encoding).encode(exec_charset)). – Honna 2/5, 2017 at 11:42

@J.F.Sebastian Thanks! Indeed, Visual C++ compiler uses the system's codepage as execution character set. I've changed its character set and it works: https://mcmap.net/q/56218/-set-execution-character-set-for-visual-c-compiler/4566599 – Toro 2/5, 2017 at 14:12

Unbelievable. I can't memorize this snippet and come back here on a regular basis. Darn string handling. Thanks for providing this snippet. – Evermore 14/12, 2017 at 7:58

It looks like this is deprecated (https://mcmap.net/q/56219/-deprecated-header-lt-codecvt-gt-replacement). My compiler throws errors when I try to run this code – Benedictine 14/2, 2018 at 19:53

Deprecated on C++17... en.cppreference.com/w/cpp/locale/wstring_convert en.cppreference.com/w/cpp/locale/codecvt_utf8 – Evaginate 26/9, 2018 at 17:0

To anybody worrying about C++17 and further compatibility (due to deprecation) see: https://mcmap.net/q/16013/-c-convert-string-or-char-to-wstring-or-wchar_t – Gibeon 20/3, 2019 at 14:32

They're actually removing this method in C++26. 2 steps forward, 2 steps back? – Outright 15/4 at 21:55

Apparently <filesystem> can be used for this type of string conversion too? – Outright 16/4 at 2:26

177

An older solution from: http://forums.devshed.com/c-programming-42/wstring-to-string-444006.html

std::wstring wide( L"Wide" ); 
std::string str( wide.begin(), wide.end() );

// Will print no problemo!
std::cout << str << std::endl;

Update (2021): However, at least on more recent versions of MSVC, this may generate a wchar_t to char truncation warning. The warning can be quieted by using std::transform instead with explicit conversion in the transformation function, e.g.:

std::wstring wide( L"Wide" );

std::string str;
std::transform(wide.begin(), wide.end(), std::back_inserter(str), [] (wchar_t c) {
    return (char)c;
});

Or if you prefer to preallocate and not use back_inserter:

std::string str(wide.length(), 0);
std::transform(wide.begin(), wide.end(), str.begin(), [] (wchar_t c) {
    return (char)c;
});

See example on various compilers here.

Beware that there is no character set conversion going on here at all. What this does is simply to assign each iterated wchar_t to a char - a truncating conversion. It uses the std::string c'tor:

template< class InputIt >
basic_string( InputIt first, InputIt last,
              const Allocator& alloc = Allocator() );

As stated in comments:

values 0-127 are identical in virtually every encoding, so truncating values that are all less than 127 results in the same text. Put in a chinese character and you'll see the failure.

the values 128-255 of windows codepage 1252 (the Windows English default) and the values 128-255 of unicode are mostly the same, so if that's teh codepage you're using most of those characters should be truncated to the correct values. (I totally expected á and õ to work, I know our code at work relies on this for é, which I will soon fix)

And note that code points in the range 0x80 - 0x9F in Win1252 will not work. This includes €, œ, ž, Ÿ, ...

Minor answered 23/8, 2012 at 18:18 Comment(16)

Bizarrely, this works on Visual Studio 10. What is going on? This should cause a truncating assigment from wchar_t to char for all elements of the original string. – Congressional 4/1, 2013 at 17:41

@PedroLamarão: Why? since it's std::wstring which is template specializing of class 'string' for type 'wchar_t' in STL. – Minor 16/1, 2013 at 21:8

The second line above creates an std::string (presumably) from a Range of Iterators, whose value type must then be char. But [wide.begin(), wide.end()) is a Range of Iterators whose value type is wchar_t, whose size is greater than the size of char. I see now that even your source states this is not portable. Perhaps it's Visual Studio specific. – Congressional 5/2, 2013 at 15:26

Just tried it here codepad.org/zUh426eh and it worked. I believe they use some flavor of GCC. The said string constructor that takes ranges must have a std template specialization implemented for wstring to string and back as it is cross compatible. – Minor 11/2, 2013 at 3:14

...when it goes to any non-latin characters. – Lunalunacy 31/5, 2013 at 21:14

@PedroLamarão: values 0-127 are identical in virtually every encoding, so truncating values that are all less than 127 results in the same text. Put in a chinese character and you'll see the failure. – Radiobiology 4/9, 2013 at 20:20

@MooingDuck I thought I'd seen this work for á or õ but I have tried again just now and it doesn't. Your reasoning must be correct. – Congressional 5/9, 2013 at 13:16

@PedroLamarão: the values 128-255 of windows codepage 1252 (the Windows English default) and the values 128-255 of unicode are mostly the same, so if that's teh codepage you're using most of those characters should be truncated to the correct values. (I totally expected á and õ to work, I know our code at work relies on this for é, which I will soon fix) – Radiobiology 5/9, 2013 at 16:30

No problems when using g++ 4.8.1 on Linux. Also works on VS2005. – Kila 4/10, 2015 at 11:38

Didn't work on Solaris 10 either... error is "Could not find a match for std::string::basic_string(wchar_t*, wchar_t*)" – Ario 23/5, 2017 at 5:52

This used to work for me until I upgraded to MSVC 2019 and v142 toolset - now it craps out with a warning (which I always treat as errors): warning C4244: 'argument': conversion from 'const wchar_t' to 'const _Elem', possible loss of data – Cowart 14/4, 2019 at 13:26

@Cowart I noticed the same thing. Although the warning is unpleasant, that truncation is desired. If you're worried about the uncertainty, you could always extract this conversion to a method where you disable the warning. Of course you'd also want to do a check before converting to make sure the wstring values are within a valid 1 byte range for the returning string. If they are outside that range, throw an error or handle the case however you see fit. – Amphibrach 17/7, 2019 at 18:4

Updated with a slightly more verbose but warning-free approach. Should work on Solaris too. – Shurlock 6/5, 2021 at 16:15

The use of std::transform there with the lambda object incurs a 2x perf cost – Outright 15/4 at 22:35

The 2 MSVC++ warnings are due to the truncation & apparently a signed / unsigned mismatch doing the assignment of a wchar_t to a char – Outright 15/4 at 22:36

But you can just suppress those warnings instead: #pragma warning( suppress : 4244 4365 ). You should put the #pragma just above the std::string s(wstring.begin(), wstring.end()) line of code (suppress only works for only 1 line) – Outright 15/4 at 22:37

Here is a worked-out solution based on the other suggestions:

#include <string>
#include <iostream>
#include <clocale>
#include <locale>
#include <vector>

int main() {
  std::setlocale(LC_ALL, "");
  const std::wstring ws = L"ħëłlö";
  const std::locale locale("");
  typedef std::codecvt<wchar_t, char, std::mbstate_t> converter_type;
  const converter_type& converter = std::use_facet<converter_type>(locale);
  std::vector<char> to(ws.length() * converter.max_length());
  std::mbstate_t state;
  const wchar_t* from_next;
  char* to_next;
  const converter_type::result result = converter.out(state, ws.data(), ws.data() + ws.length(), from_next, &to[0], &to[0] + to.size(), to_next);
  if (result == converter_type::ok or result == converter_type::noconv) {
    const std::string s(&to[0], to_next);
    std::cout <<"std::string =     "<<s<<std::endl;
  }
}