How to reinterpret a sequence of bytes as a POD structure without causing UB?
Asked Answered
S

2

7

Suppose we get some data as a sequence of bytes, and want to reinterpret that sequence as a structure (having some guarantees that the data is indeed in the correct format). For example:

#include <fstream>
#include <vector>
#include <cstdint>
#include <cstdlib>
#include <iostream>

struct Data
{
    std::int32_t someDword[629835];
    std::uint16_t someWord[9845];
    std::int8_t someSignedByte;
};

Data* magic_reinterpret(void* raw)
{
    return reinterpret_cast<Data*>(raw); // BAD! Breaks strict aliasing rules!
}

std::vector<char> getDataBytes()
{
    std::ifstream file("file.bin",std::ios_base::binary);
    if(!file) std::abort();
    std::vector<char> rawData(sizeof(Data));
    file.read(rawData.data(),sizeof(Data));
    if(!file) std::abort();
    return rawData;
}

int main()
{
    auto rawData=getDataBytes();
    Data* data=magic_reinterpret(rawData.data());
    std::cout << "someWord[346]=" << data->someWord[346] << "\n";
    data->someDword[390875]=23235;
    std::cout << "someDword=" << data->someDword << "\n";
}

Now the magic_reinterpret here is actually bad, since it breaks strict aliasing rules and thus causes UB.

How should it instead be implemented to not cause the UB and not do any copies of data like with memcpy?


EDIT: the getDataBytes() function above was in fact considered some unchangeable function. A real-world example is ptrace(2), which on Linux, when request==PTRACE_GETREGSET and addr==NT_PRSTATUS, writes (on x86-64) one of two possible structures of different sizes, depending on tracee bitness, and returns the size. Here ptrace calling code can't predict what type of structure it will get until it actually does the call. How could it then safely reinterpret the results it gets as the correct pointer type?

Shivaree answered 11/12, 2015 at 12:20 Comment(6)
Essentially, instead of casting a byte array with the data to the struct, take a struct instance, cast it (adress) to a byte array and fill this array with the data. [J. Pileborg posted it as code below]Engrossment
But, while you did mention it yourself, just a reminder to always think of int sizes, negative number formats, struct alignment, padding, ...Engrossment
In particular, Data is going to be 64 bits on most machines, but there is only 56 bits of information there.Crossexamine
After reading the edit, it seems you're in an impossible situation. If you try to do it in C++ you will technically have undefined behavior (though it will work fine with a reinterpret_cast). You could do it in C and use a union to do type-punning, maybe write a function in C to do only this?Aeriela
Also, is there nothing in the file which will tell you what kind of structure the next one will be? A single byte or or integer? Then you could modify the structure to include that single identifying value as the first member, peek into the file to see what kind of structure the next piece is, and use a read call like in my answer for that specific structure.Aeriela
@JoachimPileborg the question deals with the cases when there're no such hints. In the case of the example in the edit I don't know of any way to query size of regset and thus can't infer type of structure to expect before calling ptrace. The C union solution looks workable, although it's really a hack.Shivaree
A
4

By not reading the file as a stream of bytes, but as a stream of Data structures.

Simply do e.g.

Data data;
file.read(reinterpret_cast<char*>(&data), sizeof(data));
Aeriela answered 11/12, 2015 at 12:24 Comment(0)
C
1

I think these is a special exception to the strict aliasing rules for all the char types (signed, unsigned, and plain). So I think all you have to do, is change the signature of magic_reinterpret to:

Data* magic_reinterpret(char *raw)

Doesn't work I'm afraid. As commented by deviantfan, you can read (or write) a Data as a series of [unsigned] char, but you can't read or write char as a Data. The answer by Joachim is correct.

Having said all that. If you are reading from a network or file, the extra overhead of reading your input as a series of octets and calculating the fields from a buffer is going to be negligible (and will allow you to cope with changes in layout between compiler versions, and machines).

Crossexamine answered 11/12, 2015 at 12:25 Comment(4)
While char is indeed special, it's not like you think. It's not bidirectional. Btw., signed char is not ok, even if plain char acts as signed variable.Engrossment
Damn! I answered 4 minutes ago, come back to fix it, and find someone's already commented.Crossexamine
Would you like it more if you collect silent downvotes, without knowing what's wrong? ... SO has pretty many active users, in tags like C++ it doesn't take long to get reactions.Engrossment
Absolutely not! Silent down votes strike me as anti-social. I was just grumbling about how little time I had to cover my tracks :-)Crossexamine

© 2022 - 2024 — McMap. All rights reserved.