Stumped with Unicode, Boost, C++, codecvts
Asked Answered
O

3

16

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.

But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.

My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.

The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.

#include <string>
#include <boost/locale.hpp>
#include <locale>

int main(void)
{
  std::string data("Testing, 㤹");

  std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
  std::locale toLoc   = boost::locale::generator().generate("en_US.UTF-32");

  typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
  cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);

  std::locale convLoc = std::locale(fromLoc, toCvt);

  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

  // Output is unconverted -- what?

  return 0;
}

I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?

Outdistance answered 22/10, 2011 at 12:49 Comment(0)
O
12

Okay, after a long few months I've figured it out, and I'd like to help people in the future.

First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).

#include <boost/locale.hpp>
namespace loc = boost::locale;

int main(void)
{
  loc::generator gen;
  std::locale blah = gen.generate("en_US.utf-32");

  std::string UTF8String = "Tésting!";
  // from_utf will also work with wide strings as it uses the character size
  // to detect the encoding.
  std::string converted = loc::conv::from_utf(UTF8String, blah);

  // Outputs a UTF-32 string.
  std::cout << converted << std::endl;

  return 0;
}

As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.

I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.

As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.

Outdistance answered 9/12, 2011 at 6:56 Comment(3)
Here is a link that doesn't go to the index page (for that last link) boost.org/doc/libs/1_51_0/libs/locale/doc/html/…Levasseur
You say using codecvt is bad, but why Boost uses codecvt as the convert method in its file system, precisely, path class?Zedoary
Hmm.... I answered this in '11, so I'm not exactly sure what my mindset was. I suppose the codecvt way that /I/ posted was the wrong way of doing it. Boost.Locale itself uses codecvts to interface with Boost.Filesystem.Outdistance
S
3
  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

This does no conversion, since it uses codecvt<char, char, mbstate_t> which is a no-op. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.

To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.

Sparrow answered 22/10, 2011 at 13:6 Comment(8)
@Jookia: It's unclear to me what exactly do you want. You're trying to output a string with unknown encoding (writing a string literal containing unicode characters is already non-portable) to cout, which doesn't have a standardized encoding and you're free to assume whatever encoding it is. I always assume that cout is UTF-8 and let the user configure his console to use UTF-8 or open the files with editors that understand UTF-8.Sparrow
My code isn't the entire problem, I'm having trouble dealing with Boost, C++, locales and Unicode in general. I want to use UTF-8 strings in my program, and translate the user's locale from/to UTF-8, for use with cout and cin, which I can't figure out how to do. But then I want to use UTF-8 and Boost, which seems to be impossible as it uses wide strings, which don't help at all.Outdistance
@Jookia: Again, your question is too vague: "I'm having trouble ... in general"! "I want to use UTF-8 strings in my program" go on! That's what I do. "user's locale from/to UTF-8 for use with cout and cin" why? Just assume it's UTF-8 and let those who use legacy encoding change their encoding to UTF-8. On windows you are meant to use wcin and wcout to read/write unicode data, but it's going to be non-portable as you'll have to maintain two versions of your code, one that uses wcout on windows and one that uses cout on non-windows. You don't want this, do you?Sparrow
"But then I want to use UTF-8 and Boost, which seems to be impossible" in some parts of boost it's possible but inconvenient. Some parts of boost don't support unicode on windows at all (Boost.Interprocess), some do it wrongly (Boost.Program_Options) and some are painful for cross-platform code (Boost.Filesystem). "as it uses wide strings" no. Some parts of boost use narrow-char (Boost.Interprocess), some use both (Boost.Filesystem). The problem is that those who use the narrow-string assume the native encoding instead of UTF-8 by default, imposing the burden on you.Sparrow
There is a fight in boost community on deprecating the wide-char and assuming all narrow-strings are UTF-8. We (proponents of UTF-8) are currently loosing since there's no much demand, and most boost developers (e.g. author of Filesystem) live in Unix world and not facing the real-world trouble of writing Unicode-correct production code portable among windows and linux. If you want to change the status-quo, open the discussion in boost-mailing list again.Sparrow
So I should drop the idea of Unicode and stick with good ol' ASCII, seeing as you can't really use Unicode portably? Or maybe I should just stick with UNIX?Outdistance
@Jookia: Why do you give up? Yes, you can't always do it portably with existing libraries, and you need to write boiler-plate code to do it with others. The choice of supporting it within the boundaries of the possible is up to you.Sparrow
I give up because I don't understand what the problem is fully, and from that I can't actually deduce a possible solution.Outdistance
A
3

The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.

However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.

A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.


Example code for using Boost filesystem in Windows:
#include "CmdLineArgs.h"        // CmdLineArgs
#include "throwx.h"             // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )

#include <boost/filesystem/fstream.hpp>     // boost::filesystem::ifstream
#include <iostream>             // std::cout, std::cerr, std::endl
#include <stdexcept>            // std::runtime_error, std::exception
#include <string>               // std::string
#include <stdlib.h>             // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;

inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }

int main()
{
    try
    {
        CmdLineArgs const   args;
        wstring const       programPath     = args.at( 0 );

        hopefully( args.nArgs() == 2 )
            || throwX( "Usage: " + ansi( programPath ) + " FILENAME" );

        wstring const       filePath        = args.at( 1 );
        bfs::ifstream       stream( filePath );     // Nice Boost ifstream subclass.
        hopefully( !stream.fail() )
            || throwX( "Failed to open file '" + ansi( filePath ) + "'" );

        string line;
        while( getline( stream, line ) )
        {
            cout << line << endl;
        }
        hopefully( stream.eof() )
            || throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );

        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}
Archaeological answered 22/10, 2011 at 13:22 Comment(11)
I'm trying to do it cross-platform.Outdistance
@Jookia: ok. i was assuming you restricted yourself to UTF-8 locale *nix (and Mac), and Windows. supporting general cross-platform is I think not something one man can do. good luck!Archaeological
@Jookia: This answer is a proof of some of my claims below. To use boost.filesystem with unicode on windows you must use wstring, on non-windows you definitely want to use string. This is how boost.filesystem does not hide the platform differences and does not make writing cross-platform code simpler. I must admit that in case of boost.fs you can change the way it interprets narrow-strings to UTF-8, thus making it easier to port the code. However, the point is that boost could make our life much easier by just changing two lines in boost.fs. And it's a pity they don't want to.Sparrow
@ybungalobill: note that boost filesystem does not support general filenames with g++ in windows, and that that problem can't be fixed by using utf-8 encoding everywhere.Archaeological
@AlfP.Steinbach Excuse me, what do you mean exactly? If it's compiled against BOOST_POSIX_API then indeed it does not. If it's compiled against BOOST_WINDOWS_API then the only part that doesn't is the boost::filesystem::i/ofstream. They could implement the later through implementing the filebuf using winodows API directly (I did this).Sparrow
@ybungalobill: i mean exactly what i wrote, which was pretty exact. e.g., currently boost::filesystem::ifstream does not manage to open a file with a name like [π.recipe], when it's used with g++. well unless one sets the ANSI codepage to some encoding that supports π, then forsaking some other characters. using utf-8 doesn't fix that. the upshot is that neither the standard library nor boost supports general filenames in general in Windows, except for with Visual C++.Archaeological
@AlfP.Steinbach "boost filesystem does not support general filenames" can mean, e.g. "boost::filesystem::remove() doesn't support general filenames" among all other possible interpretations. Now it's clear. Anyway, the point of UTF-8 is not to magically support unicode but rather to provide a uniform portable interface among systems that have such support.Sparrow
@ybungalobill: hm. i think a "uniform portable interface" is good idea, but utf8 as common encoding clashes with the principle of not paying for what you don't need or use. the native encoding used should instead be abstracted away by the uniform portable interface. anyway you first need something that works on all relevant platforms, and standard filestreams don't with g++ in Windows. however, a set of wrappers like this go a long way toward supporting std streams in Windows. it's only when short names are turned off in Windows, that it fails. hth.,Archaeological
@AlfP.Steinbach: What are you paying if you assume UTF-8? If you're talking about the conversion to/from UTF-8, then I say that any portable solution will involve some kind of conversion. This already starts when C gives preference to narrow-char versus wide-char ("" is shorter than L""). It continues when you have existing standardized interfaces that you can't change the API/ABI but can change the semantics (e.g. the only way to make std::exception::what unicode-aware on all platforms is to standardize it to be UTF-8, or whatever UTF that fits within the char on that platform).Sparrow
So if you really want to analyze the cost of assuming UTF-8, you must analyze the usage patterns and figure out where you need to do the conversions. Moreover, there's no such thing as native encoding on windows. You have two encodings. UTF-16 and some deprecated non-unicode aware 'ANSI' encoding. Furthermore, you always talk about boost::fs::fstreams, which is a rather minor part of this library. The rest of the functionality will work unchanged.Sparrow
"...you first need something that works on all relevant platforms..." disagree. If it doesn't work on one platform where this can't work because it's impossible, is it an excuse for scrapping it even if it makes things easier on all other platforms?Sparrow

© 2022 - 2024 — McMap. All rights reserved.