Premise
- I have a blob of binary data in memory, represented as a
char*
(maybe read from a file, or transmitted over the network). - I know that it contains a UTF8-encoded text field of a certain length at a certain offset.
Question
How can I (safely and portably) get a u8string_view
to represent the contents of this text field?
Motivation
The motivation for passing the field to down-stream code as a u8string_view
is:
- It very clearly communicates that the text field is UTF8-encoded, unlike
string_view
. - It avoids the cost (likely free-store allocation + copying) of returning it as
u8string
.
What I tried
The naive way to do this, would be:
char* data = ...;
size_t field_offset = ...;
size_t field_length = ...;
char8_t* field_ptr = reinterpret_cast<char8_t*>(data + field_offset);
u8string_view field(field_ptr, field_length);
However, if I understand the C++ strict-aliasing rules correctly, this is undefined behavior because it accesses the contents of the char*
buffer via the char8_t*
pointer returned by reinterpret_cast
, and char8_t
is not an aliasing type.
Is that true?
Is there a way to do this safely?
char
is special here. Is gcc/clang... issuing a warning? – Odisodiumchar
is special but I don't think it applies here. Achar*
can alias anything but achar8_t*
cannot alias a char as far as I know. – Gadmannstd::start_lifetime_as
, but I'm not sure if there's anything the help that case in C++20 besides acknowledging that you're UBing to achieve that. – Gadmannchar
can be used to hold other objects via placement-new, for example. Considering that both types are trivial, I wonder if it would make the behavior defined to iterate the buffer and assign itself to the target type? E.g.for (auto i = data + field_offset; i < data + field_offset + field_length; ++i) { *reinterpret_cast<char8_t *>(i) = *i; }
If this is defined behavior then that could be a workaround to avoid the UB. Assuming the bit representation of each value is identical, a smart compiler could elide the whole loop. – Mayfieldnew (i) char8_t{*i};
? If defined, it would be interesting to see what compilers do with both loops. – Mayfieldreinterpret_cast
s data received from the network or read from files. It's very common practice and the standard is defective for not acknowledging it. – Magnusson