Read/Write file with unicode file name with plain C++/Boost
Asked Answered
C

4

12

I want to read / write a file with a unicode file name using boost filesystem, boost locale on Windows (mingw) (should be platform independent at the end).

This is my code:

#include <boost/locale.hpp>
#define BOOST_NO_CXX11_SCOPED_ENUMS
#include <boost/filesystem.hpp>
#include <boost/filesystem/fstream.hpp>
namespace fs = boost::filesystem;

#include <string>
#include <iostream>

int main() {

  std::locale::global(boost::locale::generator().generate(""));
  fs::path::imbue(std::locale());

  fs::path file("äöü.txt");
  if (!fs::exists(file)) {
    std::cout << "File does not exist" << std::endl;
  }

  fs::ofstream(file, std::ios_base::app) << "Test" << std::endl;
}

The fs::exists really checks for a file with the name äöü.txt. But the written file has the name äöü.txt.

Reading gives the same problem. Using fs::wofstream doesn't help either, since this just handles wide input.

How can I fix this using C++11 and boost?

Edit: Bug report posted: https://svn.boost.org/trac/boost/ticket/9968

To clarify for the bounty: It is quite simple with Qt, but I would like a cross platform solution using just C++11 and Boost, no Qt and no ICU.

Conveyancing answered 30/4, 2014 at 16:53 Comment(8)
Actually, given äöü.txt, it looks like the literal is already UTF8, except boost::fs::path is treating it as if it were CodePage 1252. Or more likely, boost::fs::path is ignoring the encoding altogeather, and simply passing to the OS, and the OS is assuming it's codepage 1252.Flatfish
Rereading the question, fs::exists is working, so that means that the error must be in boost::fs::ofstream. I would guess it's detecting that you're compiling with GCC and so incorrectly deciding to pass the OS a UTF8 encoded filename. That would be a boost bug. (An answer was deleted, but OP clarified problem is identical for wide string literal)Flatfish
Possibly äöü are not in the source character set; try replacing them with the equivalent hex literals (I'm assuming you mean the versions of these characters that are storable in an 8-bit char).Sotted
But then, why does fs::exists work? It really seems to be a problem in the filesystem streams, so I#m looking for a solution without them, or a fix for them.Conveyancing
I have tested on Ubuntu 12.04 with boost v1.48, the issue is not reproduced. Maybe you can check which boost version you're using and see if it's already fixed or if it's mingw's issue.Downer
I'm using Boost 1.55. How would I see, if it is mingw issue?Conveyancing
In what encoding is your source file, and what is your system encoding ? If you wrote a program as simple as int main() { std::cout << "äöü"; return 0; } what would output be ? (includes omitted for clarity ...)Grecism
The encoding of the source file may be everything, but it is most likely to be UTF8. But I don't see why the content matters for the file name. The system encoding, is whatever the user uses, since I need it to be platform independent. Currently I'm testing on Windows, so cp 1252.Conveyancing
P
10

This can be complicated, for two reasons:

  1. There's a non-ASCII string in your C++ source file. How this literal gets converted to the binary representation of a const char * would depend on compiler settings and/or OS codepage settings.

  2. Windows only works with Unicode filenames through the UTF-16 encoding, while Unix uses UTF-8 for Unicode filenames.

Constructing the path object

To get this working on Windows, you can try to change your literal to wide characters (UTF-16):

const wchar_t *name = L"\u00E4\u00F6\u00FC.txt";
fs::path file(name);

To get a full cross-platform solution, you'll have to start with either a UTF-8 or a UTF-16 string, then make sure it gets properly converted to the path::string_type class.

Opening the file stream

Unfortunately, the C++ (and thus Boost) ofstream API does not allow specifying wchar_t strings as the filename. This is the case for both the constructor and the open method.

You could try to make sure that the path object does not get immediately converted to const char * (by using the C++11 string API) but this probably won't help:

std::ofstream(file.native()) << "Test" << std::endl;

For Windows to work, you might be able have to call the Unicode-aware Windows API, CreateFileW, convert the HANDLE to a FILE *, then use the FILE * for the ofstream constructor. This is all described in another StackOverflow answer, but I'm not sure if that ofstream constructor will exist on MinGW.

Unfortunately basic_ofstream doesn't seem to allow subclassing for custom basic_filebuf types, so the FILE * conversion might be the only (completely non-portable) option.

An alternative: Memory-mapped files

Instead of using file streams, you can also write to files using memory-mapped I/O. Depending on how Boost implements this (it's not part of the C++ standard library), this method could work with Windows Unicode file names.

Here's a boost example (taken from another answer) that uses a path object to open the file:

#include <boost/filesystem.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <iostream>

int main()
{ 
  boost::filesystem::path p(L"b.cpp");
  boost::iostreams::mapped_file file(p); // or mapped_file_source
  std::cout << file.data() << std::endl;
}
Panay answered 9/5, 2014 at 1:20 Comment(5)
Since fs::exists finds the right file, doesn't that mean, that the string is correctly converted to path::string_type?Conveyancing
I've updated my answer. Looking at the C++ APIs available, the outlook isn't good...Panay
Thank you for your detailed answer, I need to get through it.Conveyancing
Thank you very much, the mapped file does the trick. Is there a drawback using it in comparison to reading and writing files directly with the file streams?Conveyancing
The drawbacks are that you need to resize your file (to the correct size) before opening it, and that any read/write errors after opening the file will manifest as SIGBUS on Unix and EXCEPTION_IN_PAGE_ERROR on Windows—terminating your application if unhandled ;)Panay
B
4

I don't know how the answer here got accepted, since OP does fs::path::imbue(std::locale()); precisely not to give a damn about OS's codepage, std::wstring and what not. Otherwise yeah, he'd just use plain old iconv, Winapi calls or other things suggested in the accepted answer. But that is not the point of using boost::locale here.

The real answer why this doesn't work, even though OP does imbue() current locale like instructed in the Boost's documentation (see "Default Encoding under Microsoft Windows"), is because of boost (or mingw) bugs that go unresolved for at least a couple of years as of March 2015.

Unfortunately, mingw users seem to be left out in the cold.

Now, what boost developers should do in order to cover for these bugs is a whole different matter. It might turn out they need to implement precisely what Dan has stated.

Broadway answered 8/3, 2015 at 17:40 Comment(0)
P
2

Have you considered the approach of using ASCII characters in your source code and using the Boost Messages Formatting capabilities of the Boost.Locale library to look up the desired string using a ASCII key? http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/messages_formatting.html

Alternatively you can use the Boost.Locale library to generate a UTF-8 library and then imbue Boost.Path with that locale using " boost::path::imbue()." http://boost.2283326.n4.nabble.com/boost-filesystem-path-as-utf-8-td4320098.html

This may also be of use to you.

Default Encoding under Microsoft Windows http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/default_encoding_under_windows.html

Pauper answered 10/5, 2014 at 4:41 Comment(0)
G
1

EDIT : add references to boost and wchar_t at end of post and another possible solution on Windows

I could reproduce nearly same thing on ubuntu and on windows without even using boost (I don't have it on my windows box). To fix it, I just had to convert the source in the same encoding as the system, ie utf8 on Ubuntu and latin1 or iso-8859-1 on Windows.

As I suspected, the problem comes from the line fs::path file("äöü.txt");. As the encoding of the file is not what is expected it is more or less read as fs::path file("äöü.txt");. It you control, you will find that the size is 10. That fully explains that the output file has a wrong name.

I suspect that the test if (!fs::exists(file)) correctly works because either boost or windows automatically fixes the encoding on input.

So on Windows, simply use an editor in code page 1252 or latin1 or iso-8859-1, and you should not have problems, provided you do not have to use characters outside of this charset. If you need characters outside of Latin1 I am afraid that you will have to use the unicode API of Windows.

EDIT:

In fact, Windows (> NT) works natively with wchar_t and not char. And not surprisingly, boost on windows does the same - see boost library filesystemreference. Extract :

For Windows-like implementations, including MinGW, path::value_type is wchar_t. The default imbued locale provides a codecvt facet that invokes Windows MultiByteToWideChar or WideCharToMultiByte API with a codepage of CP_THREAD_ACP if Windows AreFileApisANSI()is true ...

So, another solution in Windows that would allow full unicode character set (or at least the subset natively offered by Windows) would be to give the file path as as wstring and not as as string. Alternatively if you really want to use UTF8 encoded filenames you will have to force the thread locale to use UTF8 and not CP1252. I cannot give code example of that because I don't have boost on my windows box, my windows box runs old XP and does not support UTF8 and I don't want to post untested code, but I think that in that case, you should replace

std::locale::global(boost::locale::generator().generate(""));

with something like :

std::locale::global(boost::locale::generator().generate("UTF8"));

BEWARE : untested so I'm not sure if the string for generate is UTF8 or something else ...

Grecism answered 9/5, 2014 at 23:6 Comment(3)
Ths size of the internal string is 10, since the string contains the utf8 representation which is 10 bytes long. The need to save the files in different encodings is not possible for me, since then my program won't be cross platform.Conveyancing
What I meant, is that is under Windows you get a file name encoded in UTF8, you must tell the system it is not Ansi encoded because it is the default. If you need a unicode constant portable between Windows and Linux, you should use wide characters as suggested by Dan Cecile. By the way after digging in boost doc, I think you should try in you app fs::imbue(std::locale()); instead of fs::path::imbue(std::locale()); to tell all boost FileSystem module what the current locale is - even if it should already be the default as you take it from boost::locale::generator().generate("").Grecism
Setting the global locale and imbue the path should tell the path object that the string it gets is UTF8 encoded. And the fs::exists just works fine this way. The memory mapped files, which I use together with the path like proposed, also find the right file. So, I think it's possible without wide characters.Conveyancing

© 2022 - 2024 — McMap. All rights reserved.