How to write a non-English string to a file and read from that file with C++?
Asked Answered
S

5

7

I want to write a std::wstring onto a file and need to read that content as std:wstring. This is happening as expected when the string as L"<Any English letter>". But the problem is happening when we have character like Bengali, Kannada, Japanese etc, any kind of non English letter. Tried various options like:

  1. Converting the std::wstring to std::string and write onto the file and reading time read as std::string and convert as std::wstring
    • Writing is happening (I could see from edito) but reading time getting wrong character
  2. Writing std::wstring onto wofstream, this is also not helping for native language character letters like std::wstring data = L"হ্যালো ওয়ার্ল্ড";

Platform is mac and Linux, Language is C++

Code:

bool
write_file(
    const char*         path,
    const std::wstring  data
) {
    bool status = false;
    try {
        std::wofstream file(path, std::ios::out|std::ios::trunc|std::ios::binary);
        if (file.is_open()) {
            //std::string data_str = convert_wstring_to_string(data);
            file.write(data.c_str(), (std::streamsize)data.size());
            file.close();
            status = true;
        }
    } catch (...) {
        std::cout<<"exception !"<<std::endl;
    }
    return status;
}


// Read Method

std::wstring
read_file(
    const char*  filename
) {
    std::wifstream fhandle(filename, std::ios::in | std::ios::binary);
    if (fhandle) {
        std::wstring contents;
        fhandle.seekg(0, std::ios::end);
        contents.resize((int)fhandle.tellg());
        fhandle.seekg(0, std::ios::beg);
        fhandle.read(&contents[0], contents.size());
        fhandle.close();
        return(contents);
    }
    else {
        return L"";
    }
}

// Main

int main()
{
  const char* file_path_1 = "./file_content_1.txt";
  const char* file_path_2 = "./file_content_2.txt";

  //std::wstring data = L"Text message to write onto the file\n";  // This is happening as expected
  std::wstring data = L"হ্যালো ওয়ার্ল্ড";
// Not happening as expected.

  // Lets write some data
  write_file(file_path_1, data);
 // Lets read the file
 std::wstring out = read_file(file_path_1);

 std::wcout<<L"File Content: "<<out<<std::endl;
 // Let write that same data onto the different file
 write_file(file_path_2, out);
 return 0;
}
Sankaran answered 2/8, 2013 at 8:21 Comment(4)
Use std::wifstream and std::wofstream (or std::wfstream), then you can use std::wstring directly.Biafra
@JoachimPileborg, I wrote the above sample code but this is not working as expected when the string contains any no English character... like std::wstring data = L"হ্যালো ওয়ার্ল্ড"; etc..Sankaran
Unrelated, but why do you open the file in binary mode if you're only reading/writing text? Also, when writing you don't have to flush the file as that will be done by closing it.Biafra
@JoachimPileborg He does (but that may be the result of an edit after your comments). But in most implementations (and what I would expect) is that the locales "C" (the default) and "Posix" will only map codes corresponding to ASCII characters.Micronutrient
M
3

How a wchar_t is output depends on the locale. The default locale ("C") generally doesn't accept anything but ASCII (Unicode code points 0x20...0x7E, plus a few control characters.)

Any time a program handles text, the very first statement in main should be:

std::locale::global( std::locale( "" ) );

If the program uses any of the standard stream objects, the code should also imbue them with the global locale, before any input or output.

Micronutrient answered 2/8, 2013 at 8:33 Comment(7)
When i have added std::locale::global( std::locale( "" ) ); in main.. getting exception as libc++abi.dylib: terminate called throwing an exception Abort trap: 6Sankaran
@AbhrajyotiKirtania That's strange, because most of my C++ programs start this way (and it is required to "work" by the C++ standard, although the implementation gets to define what it means by "work"). What's your environment? (And if it's Unix based, what are $LANG and the $LC_... set to?)Micronutrient
I am trying on Unix based system... How does this $LANG set makes diff?Sankaran
$LANG determines the locale used by std::locale( "" ). Under Unix, passing an empty string as the name of the locale causes (or should cause) the implementation to construct a locale based on $LANG and the $LC_... environment variables. std::locale::global( std::locale( "" ) ); is the C++ equivalent of setlocale( LC_ALL, "" ), as defined by Posix (in pubs.opengroup.org/onlinepubs/9699919799/functions/…). If it doesn't work, and your $LANG and $LC_... are set reasonably, then this is a serious bug in the g++ libraries.Micronutrient
I would really recommend against doing internationalization by relying on the system's locale.Chiclayo
@JamesKanze "That's strange, because most of my C++ programs start this way" libstdc++ on OS X doesn't implement proper locale support so it won't work except with the "C" locale. It will consider the normal system locale names to be invalid. libc++ has proper locale support though.Chiclayo
I should clarify; using the system locale may be fine for things like getting default punctuation, formats, etc., but encodings should never depend on locales.Chiclayo
P
0

To read and write unicode files (assuming you want to write unicode characters) you can try fopen_s

FILE *file;

if((fopen_s(&file, file_path, "w,ccs=UNICODE" )) == NULL)
{
    fputws(your_wstring().c_str(), file);
}
Patnode answered 2/8, 2013 at 8:29 Comment(4)
There is no such function in C++. (There is in C11, but C++11 is based on C99.)Micronutrient
@mag_zbc, i think fopen_s is not a c++ standard function.. could you please check?Sankaran
@AbhrajyotiKirtania It's part of C11 (which means that a lot of C++ compilers probably do support it, if they also support C11).Micronutrient
Yes, that's true, it's not C++ standard. It works with Visual Studio though.Patnode
N
0

Later edit: this is for Windows (since no tag was present at the time of the answer)

You need to set the stream to a locale that supports those characters . Try something like this (for UTF8/UTF16):

std::wofstream myFile("out.txt"); // writing to this file 
myFile.imbue(std::locale(myFile.getloc(), new std::codecvt_utf8_utf16<wchar_t>));

And when you read from that file you have to do the same thing:

std::wifstream myFile2("out.txt"); // reading from this file
myFile2.imbue(std::locale(myFile2.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
Nanceenancey answered 2/8, 2013 at 8:34 Comment(13)
If he wants UTF-8. If he's on Windows, he probably wants UTF-16LE. And in any environment, he wants the user to decide. (But this gets tricky when reading, since files from different sources may be encoded differently.)Micronutrient
Also, std::wofstream will start with the default global locale. If he's set this correctly at the start of main, he doesn't have to imbue anything.Micronutrient
Yes, of course. I assumed that his characters are UTF-8, so that's why I used UTF-8. I'll edit my answer :)Nanceenancey
@JamesKanze, yes I know that too (about wofstream), but if he's got a similar setup like mine (Windows w/ English although I'm not English) he may have to do this. But again, you're right.Nanceenancey
Aha. I'm not really familiar with how Windows handles locales (I've only worked with Windows in an English speaking environment); I would expect that even under English Windows, there would be some way of setting the locale via environment variables, which should be picked up with locale( "" ). But given the way most Windows users work, I rather doubt that they'd be using it. (For those unfamiliar with Unix: there are no national versions for Unix. Instead, each user sets environment variables stating what language, etc. he wants to use.)Micronutrient
@JamesKanze and all, I am looking somethings from Mac/LinuxSankaran
well... there go all my ideas :)Nanceenancey
@IosifM. Your solution should also work under Linux or Mac. Except that since wchar_t is UTF-32 on these platforms, the codecvt you need should be std::codecvt_utf8_utf32<wchar_t>. And... the standard Unicode file on these systems is UTF-8, so that's almost certainly what he wants. (Of course, these codecvt are only guaranteed to be present in C++11. And I have no idea whether current versions of g++ support this part of C++11---historically, g++ has been very behind VC++ in things concerning i18n.)Micronutrient
@JamesKanze the only problem is that std::codecvt_utf8_utf32 doesn't exist afaik. However he should try this code on his machine and see if it works. If it does - yay, if it doesn't - booNanceenancey
@IosifM. You're right about std::codecvt_utf8_utf32. It should be just std::codecvt_utf8. (In your case as well, I think. But on a system where wchar_t is UTF-16, I think both will be the saame when instantiated over wchar_t.)Micronutrient
@JamesKanze codecvt_utf8<wchar_t> with 16-bit wchar_t will only support characters in the BMP, not the full Unicode range. To support UTF-16 wchar_t you must use codecvt_utf8_utf16.Chiclayo
@Chiclayo That's not what the standard says. (But as you point out, compliance with this particular section of the standard has been a weak point of many compilers.)Micronutrient
@JamesKanze It does say that; 22.5/4 "For the facet codecvt_utf8 The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem)". UCS2 uses 16-bit code units and does not support characters outside the BMP.Chiclayo
B
0

One possible problem may be when you read the string back, because you set the length of the string to the number of bytes in the file and not the number of characters. This means that you attempt to read past the end of the file, and also that the string will contain trash at the end.

If you're dealing with text files, why not simply use the normal output and input operators << and >> or other textual functions like std::getline?

Biafra answered 2/8, 2013 at 8:38 Comment(3)
I'd think just the reverse. The number of bytes will never be less than the number of characters, but it could be signifiantly more.Micronutrient
Re your edit: and input and output into std::wstring, so he doesn't have to worry about the size anywhere.Micronutrient
And with regards to your initial comment: while I doubt that this is the problem here, it is a very valid point. When reading, you must always verify that the read succeeded before using the data. And std::wistream::read is a bit special, since it will set the failbit even when it succeeds in reading some (but not all) characters; the failure condition is !stream && stream.gcount() == 0. If stream.gcount() != 0, you've successfully read that many characters.Micronutrient
C
0

Do not use wstring or wchar_t. On non-Windows platforms wchar_t is pretty much worthless these days.

Instead you should use UTF-8.

bool
write_file(
    const char*         path,
    const std::string   data
) {
    try {
        std::ofstream file(path, std::ios::out | std::ios::trunc | std::ios::binary);
        file.exceptions(true);
        file << data;
        return true;
    } catch (...) {
        std::cout << "exception!\n";
        return false;
    }
}


// Read Method

std::string
read_file(
    const char*  filename
) {
    std::ifstream fhandle(filename, std::ios::in | std::ios::binary);

    if (fhandle) {
        std::string contents;
        fhandle.seekg(0, std::ios::end);
        contents.resize(fhandle.tellg());
        fhandle.seekg(0, std::ios::beg);
        fhandle.read(&contents[0], contents.size());
        return contents;
    } else {
        return "";
    }
}

int main()
{
  const char* file_path_1 = "./file_content_1.txt";
  const char* file_path_2 = "./file_content_2.txt";

  std::string data = "হ্যালো ওয়ার্ল্ড"; // linux and os x compilers use UTF-8 as the default execution encoding.

  write_file(file_path_1, data);
  std::string out = read_file(file_path_1);

  std::wcout << "File Content: " << out << '\n';
  write_file(file_path_2, out);
}
Chiclayo answered 2/8, 2013 at 17:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.