Why mask a char with 0xFF when converting narrow string to wide string?

Asked 5/4, 2018 at 9:38 Answered 5/4, 2018 at 10:8

Consider this function to convert narrow strings to wide strings:

std::wstring convert(const std::string& input)
{
    try
    {
        std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
        return converter.from_bytes(input);
    }
    catch(std::range_error& e)
    {
        std::size_t length = input.length();
        std::wstring result;
        result.reserve(length);
        for(std::size_t i = 0; i < length; i++)
        {
            result.push_back(input[i] & 0xFF);
        }
        return result;
    }
}

I am having difficulty understanding the need for this expression in the fallback path:

result.push_back(input[i] & 0xFF);

Why is each character in the string being masked with 0xFF (0b11111111)?

Griceldagrid answered 5/4, 2018 at 9:38 Comment(10)

Why are you asking other people here, instead of the person who wrote that code sample? We have comments for that. I don't see how using a separate question instead makes sense or is useful. – Scholem 5/4, 2018 at 9:43

That looks like it converts any value higher than the highest ASCII value to an acceptable ASCII value. – Archaism 5/4, 2018 at 9:44

@ underscore_d that was written in 2016. So I was not sure if it would get answered – Griceldagrid 5/4, 2018 at 9:44

And yet its author logged on just 2 days ago, as shown in their profile, so I don't see why you'd be unable to get a reply from them. – Scholem 5/4, 2018 at 9:44

The point is I would like to know why a character would be ANDed to a byte – Griceldagrid 5/4, 2018 at 9:47

0xFF is an integer literal, not a "byte" (which isn't even a thing in C++). – Sportsman 5/4, 2018 at 9:55

@TobySpeight C++17 added std::byte... – Scholem 5/4, 2018 at 10:7

Thanks @underscore_d, I forgot that. – Sportsman 5/4, 2018 at 10:13

The only thing I can think of is on systems where whar_t is signed. By explicitly making char an int in that way prevents sign extension which would corrupt the value. – Alamo 5/4, 2018 at 10:16

MistyD, underscore_d: First, posing a follow-up question as separate question seems totally right to me. I am the requested writer of that code snippet. To answer the question, I don’t know any more because I am out of C++ for ages now again, but it seems to me that the below answer from Toby Speight is exactly correct. Short answer: It didn’t work without and I must have found this somewhere and it solved it. By the way, I was using Visual Studio with its compiler (default settings, I think), so maybe it is a problem only happening on Windows I was fixing here. – Incompetence 11/4, 2018 at 9:18

Masking with 0xFF reduces any negative values into the range 0-255.

This is reasonable if, for example, your platform's char is an 8-bit signed type representing ISO-8859-1 characters, and your wchar_t is representing UCS-2, UTF-16 or UCS-4.

Without this correction (or something similar, such as casting to unsigned char or std::byte), you would find that characters are sign-extended when promoted to the wider type.

Example: 0xa9 (© in Unicode and Latin-1, -87 in signed 8-bit) would become \uffa9 instead of \u00a9.

I think it's clearer to convert the char to an unsigned char - that works for any size char, and conveys the intent better. You can change that expression directly, or create a codecvt subclass that gives a name to what you're doing.

Here's how to write and use a minimal codecvt (for narrow → wide conversion only):

#include <codecvt>
#include <locale>
#include <string>

class codecvt_latin1 : public std::codecvt<wchar_t,char,std::mbstate_t>
{
protected:
    virtual result do_in(std::mbstate_t&,
                         const char* from,
                         const char* from_end,
                         const char*& from_next,
                         wchar_t* to,
                         wchar_t* to_end,
                         wchar_t*& to_next) const override
    {
        while (from != from_end && to != to_end)
            *to++ = (unsigned char)*from++;
        from_next = from;
        to_next = to;
        return result::ok;
    }
};

std::wstring convert(const std::string& input)
{
    using codecvt_utf8 = std::codecvt_utf8<wchar_t>;
    try {
        return std::wstring_convert<codecvt_utf8>().from_bytes(input);
    } catch (std::range_error&) {
        return std::wstring_convert<codecvt_latin1>{}.from_bytes(input);
    }
}

#include <iostream>

int main()
{
    std::locale::global(std::locale{""});

    // UTF-8:  £© おはよう
    std::wcout << convert(u8"\xc2\xa3\xc2\xa9 おはよう") << std::endl;
    // Latin-1: Â£©
    std::wcout << convert("\xc2\xa3\xa9") << std::endl;
}

Output:

£© おはよう
Â£©

Sportsman answered 5/4, 2018 at 10:8 Comment(3)

I might be missing something here but 0xa9 is 169 decimal (unsigned) . How is this signed again ? – Griceldagrid 5/4, 2018 at 10:28

@Griceldagrid On some systems both char and wchar_t are signed. – Alamo 5/4, 2018 at 10:30

If char is a signed type and if CHAR_BIT is 8, then the range of char is -128 to +127 (i.e. the same as a std::int8_t). If you store 169 into a std::int8_t, that's greater than +127, so it will be truncated to -87. – Sportsman 5/4, 2018 at 10:32

It looks like on conversion failure the code tries its own conversion by just copying the string into a wstring char for char.

The & 0FF is meant to "clean" any values higher than 255 to fit in the (extended) ASCII table. This is a no-op however because input[i] returns char and sizeof(char) == 1 which would mean that 255 is the maximum value anyway (In the case of CHAR_BIT == 8 and char == unsigned char).

The equivalent would just be to copy them over right away using the constructor:

std::wstring result(input.begin(), input.end());

Archaism answered 5/4, 2018 at 9:52 Comment(13)

so are you saying using std::wstring result(input.begin(), input.end()); in the catch statement would have the similar effect ? – Griceldagrid 5/4, 2018 at 9:53

@Griceldagrid Yes. it would. – Archaism 5/4, 2018 at 9:53

Actually I get different results if I pass in '\xc2' – Griceldagrid 5/4, 2018 at 9:56

isnt char always 8 bit ? – Griceldagrid 5/4, 2018 at 9:58

If it was ASCII, then the constant would need to be 0x7F - ASCII is a 7-bit code. – Sportsman 5/4, 2018 at 9:59

@Griceldagrid - the standard impose that a char is 8 bit or more. – Dysphasia 5/4, 2018 at 9:59

@TobySpeight There we go. I know The C++-goers here like the details :P. Thanks for the help and the spelling corrections. – Archaism 5/4, 2018 at 10:0

@TobySpeight in that case why would I get different results if I pass in '\xc2' which is unicode for copyright symbol. – Griceldagrid 5/4, 2018 at 10:1

@Misty: Probably because your platform's char is a signed type - when promoted to std::wchar_t (in the push_back()), the value will be sign-extended. – Sportsman 5/4, 2018 at 10:2

Can you explain that a little bit more in the answer ? I understand wstring uses wchar_t so not sure how that makes a difference – Griceldagrid 5/4, 2018 at 10:5

@Misty, Unicode \xc2 is Â, not a copyright symbol. (It does happen to be the first octet of a UTF-8 encoding of ©, but that's not at all the same). – Sportsman 5/4, 2018 at 10:16

Sorry I meant \xa9 – Griceldagrid 5/4, 2018 at 10:17

This won't necessarily be a no-op on systems with signed wchar_t types which could have their negative bit set by higher values (sign extension). – Alamo 5/4, 2018 at 10:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags