Cross-platform way to handle std::string/std::wstring with std::filesystem::path
Asked Answered
K

2

7

I have a sample piece of C++ code that is throwing an exception on Linux:

namespace fs = std::filesystem;
const fs::path pathDir(L"/var/media");
const fs::path pathMedia = pathDir / L"COMPACTO - Diogo Poças.mxf" // <-- Exception thrown here

The exception being thrown is: filesystem error: Cannot convert character sequence: Invalid in or incomplete multibyte or wide character

I surmise that the issue is related to the use of the ç character.

  1. Why is this wide string (wchar_t) an "invalid or incomplete multibyte or wide character"?
  2. Going forward, how do I make related code cross-platform to run on Windows and/or Linux.
    • Are there helper functions I need to use?
    • What rules do I need to enforce from a programmer's PoV?
    • I've seen a response here that says "Don't use wide strings on Linux", do I use the same rules for Windows?

Linux Environment (not forgetting the fact that I'd like to run cross-platform):

  • Ubuntu 18.04.3
  • gcc 9.2.1
  • C++17
Kapoor answered 23/10, 2019 at 11:29 Comment(0)
C
5

Unfortunately std::filesystem was not written with operating system compatibility in mind, at least not as advertised.

For Unix based systems, we need UTF8 (u8"string", or just "string" depending on the compiler)

For Windows, we need UTF16 (L"string")

In C++17 you can use filesystem::u8path (which for some reason is deprecated in C++20). In Windows, this will convert UTF8 to UTF16. Now you can pass UTF16 to APIs.

#ifdef _WINDOWS_PLATFORM
    //windows I/O setup
    _setmode(_fileno(stdin), _O_WTEXT);
    _setmode(_fileno(stdout), _O_WTEXT);
#endif

fs::path path = fs::u8path(u8"ελληνικά.txt");

#ifdef _WINDOWS_PLATFORM
    std::wcout << "UTF16: " << path << std::endl;
#else
    std::cout <<  "UTF8:  " << path << std::endl;
#endif

Or use your own macro to set UTF16 for Windows (L"string"), and UTF8 for Unix based systems (u8"string" or just "string"). Make sure UNICODE is defined for Windows.

#ifdef _WINDOWS_PLATFORM
#define _TEXT(quote) L##quote
#define _tcout std::wcout
#else
#define _TEXT(quote) u8##quote
#define _tcout std::cout
#endif

fs::path path(_TEXT("ελληνικά.txt"));
_tcout << path << std::endl;

See also
https://en.cppreference.com/w/cpp/filesystem/path/native


Note, Visual Studio has a special constructor for std::fstream which allows using UTF16 filename, and it's compatible for UTF8 read/write. For example the following code will work in Visual Studio:
fs::path utf16 = fs::u8path(u8"UTF8 filename ελληνικά.txt");
std::ofstream fout(utf16);
fout << u8"UTF8 content ελληνικά";

I am not sure if that's supported on latest gcc versions running on Windows.

Cellini answered 24/10, 2019 at 16:49 Comment(6)
Thanks! It looks like a minefield! Is it best to stick with UTF-8 and then convert to other character encodings/representations as and when required?Kapoor
It depends on what type of program you are writing. If your program is Linux only, then use UTF8 only. If your code runs on both Windows and Linux, then stick to UTF8, and use UTF16 conversion for Windows API calls. If your program was Windows only, specially a GUI program, then you would use UTF16 entirely. See also edit.Cellini
Thanks for the assistance! You've helped me clear up so many questions I had after retrieving bits and bobs of info from all over the place.Kapoor
u8path(...) is deprecated, because in C++20 we have distinct u8string and char8_t* which imply UTF-8 (as opposed of just plain string/char* with no encoding specified). std::filesystem::path can accept those as a constructor argument, making u8path redundant.Pyrrha
This answer is at least partially incorrect. std::filesystem was written with operating system compatibility in mind. The problem OP is having is a gcc Bug. Without that bug, OP's code would probably be the best way to write that kind of code in C++17. In c++20, u8"" can be used instead of L"".Hardman
This whole thing is such a mess. And the C++ standard isn't helping any by making a new type for UTF8 strings. Too many places in the std library that don't support u8strings and not to mention all the code that was happily using UTF-8 in std::string that now has to deal with this headache.Academy
C
5

Looks like a GCC bug.

According to std::filesystem::path::path you should be able to call std::filesystem::path constructor with a wide-character string and that independent of underlying platform (that's the whole point of std::filesystem).

Clang shows correct behavior.

Confluence answered 12/1, 2020 at 21:22 Comment(2)
I had a quick search in bugzilla. Has it been reported/fixed?Kapoor
I also encountered the problem and filed a bug report here: gcc.gnu.org/bugzilla/show_bug.cgi?id=95048 The example also worked back with gcc 9.1.0.Pawsner

© 2022 - 2024 — McMap. All rights reserved.