Raw string literals and file codification

Asked 30/1, 2014 at 15:33 Answered 30/1, 2014 at 15:59

C++11 introduced the raw string literals which can be pretty useful to represent quoted strings, literals with lots of special symbols like windows file paths, regex expressions etc...

std::string path = R"(C:\teamwork\new_project\project1)"; // no tab nor newline!
std::string quoted = R"("quoted string")";
std::string expression = R"([\w]+[ ]+)";

This raw string literals can also be combined with encoding prefixes (u8, u, U, or L), but, when no encoding prefix is specified, does the file encoding matters?, lets suppose that I have this code:

auto message = R"(Pick up a card)";         // raw string 1
auto cards = R"(🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🂫🂬🂭🂮)"; // raw string 2

If I can write and store the code above, its obvious that my source code is encoded as unicode, so I'm wondering:

The raw string 1 would be a unicode literal? (though it only uses ASCII characters), in other words, does the raw string inherits the codification of the file where is written or the compiler auto-detects that unicode isn't needed regardless of the file encoding?
Would be necessary the encoding prefix U on the raw string 2 in order to treat it as unicode literal or it would be unicode automatically due to its contents and/or the source file encoding?

Thanks for your attention.

EDIT:

Testing the code above in ideone.com and printing the demangled type of message and cards variables, it outputs char const*:

template<typename T> std::string demangle(T t)
{
    int status;
    char *const name = abi::__cxa_demangle(typeid(T).name(), 0, 0, &status);
    std::string result(name);
    free(name);
    return result;
}

int main()
{
    auto message = R"(Pick up a card)";
    auto cards = R"(🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🂫🂬🂭🂮)";

    std::cout
        << "message type: " << demangle(message) << '\n'
        << "cards type: " << demangle(cards) << '\n';

    return 0;
}

Output:

message type: char const*

cards type: char const*

which is even most weird than I thought, I was convinced that the type would be wchar_t (even without the L prefix).

Lucan answered 30/1, 2014 at 15:33 Comment(2)

This part of the standard is quite murky. In GCC and MSVC I believe the string will just be the bytes between the quotation marks. – Hwang 30/1, 2014 at 15:34

@Hwang "the string will be the bytes between the quotation marks" so... this would imply the source file encoding :O – Lucan 7/2, 2014 at 8:7

Yes it matters, even to compile your source. You will gonna need to use somenthing like -finput-charset=UTF-16 to compile if you are using gcc (the same thing should apply to VS).

But I IHMO, there are something more fundamental to take into account in your code. For example, std::string are containers to char, which is 1 byte large. If you are dealing with a UTF-16 for instance, you will need 2 bytes, so (despite a 'by-hand conversion') you will need at least a wchar_t (std::wstring) (or, to be safer a char16_t, to be safer in C++11).

So, to use Unicode you will need a container for it and a compiling environment prepared to handle your Unicode codifided sources.

Precarious answered 30/1, 2014 at 15:59 Comment(2)

The raw literal 1 and raw literal 2 aren't stored into any container, they're stored in deduced type variables. I did that way in the example 'cause I was not sure about which kind of std::basic_string would be the best choice. – Lucan 30/1, 2014 at 16:21

@Lucan I think your observation is very important. I am gonna test it! – Precarious 30/1, 2014 at 16:24

Raw string literals change how escapes are dealt with but do not change how encodings are handled. Raw string literals still convert their contents from the source encoding to produce a string in the appropriate execution encoding.

The type of a string literal and the appropriate execution encoding is determined entirely by the prefix. R alone always produces a char string in the narrow execution encoding. If the source is UTF-16 (and the compiler supports UTF-16 as the source encoding) then the compiler will convert the string literal contents from UTF-16 to the narrow execution encoding.

Reni answered 30/1, 2014 at 15:59 Comment(0)

Recommended topics

Hot tags