Do the strict aliasing rules in C++20 allow `reinterpret_cast` between the standard c++ unicode chars and the underlining types?
Asked Answered
S

3

3

Do the C++20's strict aliasing rules [basic.lval]/11 arbitrarily allow following...

  1. cast between char* and char8_t*
string str = "string";
u8string u8str { (char8_t*) &*str.data() }; // c++20 u8string

u8string u8str2 = u8"zß水🍌"
string str2 { (char*) u8str2.data() };
  1. cast between uint32_t*, uint_least32_t* and char32_t*
vector<uint32_t> ui32vec = { 0x007a, 0x00df, 0x6c34, 0x0001f34c };
u32string u32str { (char32_t*) &*ui32vec.data(), ui32vec.size() };

u32string u32str2 = U"zß水🍌"
vector<uint32_t> ui32vec2 { (uint32_t*) &*u32str2.begin(),
                            (uint32_t*) &*u32str2.end() };
  1. cast between uint16_t*, uint_least16_t* and char16_t*
vector<uint16_t> ui16vec = { 0x007a, 0x00df, 0x6c34, 0xd83c, 0xdf4c };
u16string u16str { (char16_t*) &*ui16vec.data(), ui16vec.size() };

u16string u16str2 = u"zß水\ud83c\udf4c"
vector<uint16_t> ui16vec2 { (uint16_t*) &*u16str2.begin(),
                            (uint16_t*) &*u16str2.end() };

Update

basic_string contructor overload (6)

template< class InputIt >    
basic_string( InputIt first, InputIt last,
              const Allocator& alloc = Allocator() );

vector constuctor overload (4)

template< class InputIt >    
vector( InputIt first, InputIt last,
        const Allocator& alloc = Allocator() );

I wonder whether it is okey to go with LegacyInputIterator constructors?...

  1. char* and char8_t* as LegacyInputIterator
string str = "string";
u8string u8str {   str.begin(),   str.end()  };
u8string u8str { &*str.begin(), &*str.end()  };

u8string u8str2 = u8"zß水🍌"
string str2 {   u8str2.begin(),   u8str2.end() };
string str2 { &*u8str2.begin(), &*u8str2.end() };
  1. uint32_t*, uint_least32_t* and char32_t* as LegacyInputIterator
vector<uint32_t> ui32vec = { 0x007a, 0x00df, 0x6c34, 0x0001f34c };
u32string u32str {   ui32vec.begin(),   ui32vec.end() };
u32string u32str { &*ui32vec.begin(), &*ui32vec.end() };

u32string u32str2 = U"zß水🍌"
vector<uint32_t> ui32vec2 { u32str2.begin(),
                            u32str2.end() };
vector<uint32_t> ui32vec2 { &*u32str2.begin(),
                            &*u32str2.end() };
  1. uint16_t*, uint_least16_t* and char16_t* as LegacyInputIterator
vector<uint16_t> ui16vec = { 0x007a, 0x00df, 0x6c34, 0xd83c, 0xdf4c };
u16string u16str {   ui16vec.begin(),   ui16vec.end() };
u16string u16str { &*ui16vec.begin(), &*ui16vec.end() };

u16string u16str2 = u"zß水\ud83c\udf4c"
vector<uint16_t> ui16vec2 { u16str2.begin(),
                            u16str2.end() };
vector<uint16_t> ui16vec2 { &*u16str2.begin(),
                            &*u16str2.end() };
Siberia answered 2/6, 2019 at 12:57 Comment(3)
@curiousguy: They are not typedefs. They are distinct types (which do have underlying types).Coracle
@curiousguy: It depends on what "these" you're talking about. Since the question was pretty focused on the char*_t, I assumed that's what you were talking about. Those are not typedefs, and cplusplus.com doesn't say that they are.Coracle
@NicolBolas I'm sorry for the confusion. I meant those identifiers in the Q which are typedef (uint32_t, uint_least32_t, uint16_t, uint_least16_t); it depends which type they represent.Kershaw
C
9

The char*_t line of types do not have any special aliasing rules. Therefore, the standard rules apply. And those rules do not have exceptions for conversion between underlying types.

So most of what you did is UB. The one case that isn't UB is char due to its special nature. You can in fact read the bytes of a char8_t as an array of char. But you can't do the opposite, reading the bytes of a char array as char8_t.

Now, these types are completely convertible to each other. So you can convert the values in those array to the other type anytime you want.

All that being said, on real implementations those things will almost certainly work. Well, until they don't, because you tried to change one thing through a thing that it's not supposed to be changed by, and the compiler doesn't reload the changed value because it assumed that it couldn't have been changed. So really, just use the correct, meaningful type.

Coracle answered 2/6, 2019 at 13:43 Comment(0)
S
2

Just so we are on the same page, the C-style casts of (T*) expression are equivalent to reinterpret_cast<T*>(expression) ([expr.cast]/4.4), which is equivalent to static_cast<T*>(static_cast<void*>(expression)) ([expr.reinterpret.cast]/7). This does nothing to the value of the pointer, as they are not pointer-interconvertible. (See [expr.static.cast]/13 and [basic.compound]/4).

So yes, we would have to look at [basic.lval]/11 to see if it can be aliased. The reference must have a type which is similar to:

  • the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
  • a char, unsigned char, or std::byte type.

Which is not the case. Even though char8_t has the underlying type of unsigned char, it is not a similar type.

So, for example:

unsigned char uc = 'a';

// Represents address of uc
unsigned char* uc_ptr = &uc;

// Still holds the address of uc, not a char8_t
char8_t* c8_ptr = reinterpret_cast<char8_t*>(uc_ptr);

char8_t c8 = *c8_ptr;  // UB, as `char8_t` is not `cv unsigned char`.

Though because of [basic.fundamentals]/6, which says:

A fundamental type specified to have a signed or unsigned integer type as its underlying type has the same object representation [...]

You can do reinterpret_cast<unsigned char*>(pointer-to-char8_t) and have all the values be equal, but that is the only case (And also char* if char is unsigned or none of the values have their sign bit set). For all other types, you can use this rule to memcpy:

// Assuming std::is_same_v<uint32_t, uint_least32_t>
vector<uint32_t> ui32vec = { 0x007a, 0x00df, 0x6c34, 0x0001f34c };
u32string u32str(ui32vec.size(), U'\x00');
std::memcpy(u32str.data(), ui32vec.data(), ui32vec.size() * sizeof(uint32_t));

u32string u32str2 = U"zß水🍌"
vector<uint32_t> ui32vec2(u32str2.size(), U'\x00');
std::memcpy(u32str2.data(), ui32vec2.data(), u32str2.size() * sizeof(uint32_t));
Schonfeld answered 2/6, 2019 at 13:58 Comment(9)
So Iter ctor or memcpy are only options because the strict alias are so strict that do not allow even underlining type : unsigned char, uint_least32_t, uint_least16_t?Siberia
@Siberia Yes. If you are specifically using vector and string, std::memcpy is by far the best option, as you would have to copy anyways (And at g++ memcpys if you reinterpret_cast, so it is the same thing). If you are using a non-owning container, like std::string_view, you cannot reinterpret.Schonfeld
Dang it. Iter ctor of string is not allow neither, because in LegacyInputIterator the value_type is un-interchangeable between any CharT and its underlining type anyway.Siberia
@Siberia The Iter ctor should work, as the iterators value type only needs to be implicitly convertible, and char32_t and uint_least32_t are implicitly convertible.Schonfeld
How about char8_t and its underlining unsigned char, are they also implicitly convertible?Siberia
@Siberia Yes. char*_t, (un)signed char, uint*_t and uint_*_t are all integral types, which are all implicitly convertible to each other.Schonfeld
@sandthorn: Convertible is a different question from whether casting from one to the other will work.Coracle
@NicolBolas By using LegacyInputIterator ctor in vector and string, does it mean that I can just do the someStringOrVector { src.begin(), src.end() } because it only requires is_convertible_v so that value_type = *Iter is valid, right?Siberia
@Siberia Yes, but it is slower as it calls all it does it through a generic iterator instead of a pointer. See the second row of the sequence container requirements tableSchonfeld
J
1

C-style cast is not the same thing as reinterpret_cast.

The standard sections I think are relevant to your question:

6.7.1.9: Type char8_­t denotes a distinct type whose underlying type is unsigned char. Types char16_­t and char32_­t denote distinct types whose underlying types are uint_­least16_­t and uint_­least32_­t, respectively, in .

7.2.1.11: If a program attempts to access the stored value of an object through a glvalue whose type is not similar ([conv.qual]) to one of the following types the behavior is undefined:

1. the dynamic type of the object,

2. a type that is the signed or unsigned type corresponding to the dynamic type of the object, or

3. a char, unsigned char, or std::byte type.

  1. char8_t*-->char* Yes.
    Because char is one of the types that all objects can be converted to. But the standard does not guarantee that the (dereferenced) converted values are equal for distinct types. char can be signed or not and char8_t is unsigned. char8_t*-->unsigned char* is valid but should not guarantee that either because it's still distinct. But given that it's char8_t's underlying type it should be, I guess?
  2. char*-->char8_t* No.
    As per 6.7.1.9 those types are distinct. Although there might be argument made that "whose underlying type is unsigned char" part could apply with unsigned char being explicitly allowed in 7.2.1.11.3 but I don't think that would be the correct interpretation and being distinct should be the deciding factor. That is supported by the following quote of a comment in the proposal P0482R6 - char8_t: A type for UTF-8 characters and strings (Revision 6 - 2018-11-09) (I did not find more recent revision):

    Finally, processing of UTF-8 strings is currently subject to an optimization pessimization due to glvalue expressions of type char potentially aliasing objects of other types. Use of a distinct type that does not share this aliasing behavior may allow for further compiler optimizations.

  3. uint32_t*<-->char32_t*, uint16_t*<-->char16_t*, uint16_t*<-->uint_least16_t*, uint32_t*<-->uint_least32_t*, uint_least32_t<-->char32_t, uint_least16_t<-->char16_t: No.
    Those pairs are all distinct, so 7.2.1.11.1 does not apply and neither type is in 7.2.1.11.3 so not even the second part of 2. can be relevant.

  4. unsigned char*-->char8_t* No.
    By the same argument as in 2. It's not T*->T* cast which is obviously allowed.

  5. char8_t*-->unsigned char* Yes.
    Because unsigned char is too one of the allowed types per 7.2.1.11.3 . But I would still argue that the standard does not guarantee that the (dereferenced) converted values will equal. But given that it's char8_t's underlying type it doesn't have any other options other than to be equal, I guess?

Jugoslavia answered 2/6, 2019 at 13:50 Comment(3)
unsigned char <-->char8_t, uint_least16_t<-->char16_t, uint_least32_t<-->char32_t : Would you mind adding these contexts in above answer?Siberia
@Siberia Sure, give me a moment.Jugoslavia
@Siberia There, except char8_t -> unsigned char are all Nos. I added them to 3.Jugoslavia

© 2022 - 2024 — McMap. All rights reserved.