Why does fread mess with my byte order?
Asked Answered
V

3

12

Im trying to parse a bmp file with fread() and when I begin to parse, it reverses the order of my bytes.

typedef struct{
    short magic_number;
    int file_size;
    short reserved_bytes[2];
    int data_offset;
}BMPHeader;
    ...
BMPHeader header;
    ...

The hex data is 42 4D 36 00 03 00 00 00 00 00 36 00 00 00; I am loading the hex data into the struct by fread(&header,14,1,fileIn);

My problem is where the magic number should be 0x424d //'BM' fread() it flips the bytes to be 0x4d42 // 'MB'

Why does fread() do this and how can I fix it;

EDIT: If I wasn't specific enough, I need to read the whole chunk of hex data into the struct not just the magic number. I only picked the magic number as an example.

Viera answered 19/12, 2011 at 3:53 Comment(3)
... bread messes with your bite order? Did you try nibbling?Brazilin
Isn't that fread instead of bread for your title?Intoxicated
sorry. I still have to get use to Lions Auto correct. I fixed itViera
I
16

This is not the fault of fread, but of your CPU, which is (apparently) little-endian. That is, your CPU treats the first byte in a short value as the low 8 bits, rather than (as you seem to have expected) the high 8 bits.

Whenever you read a binary file format, you must explicitly convert from the file format's endianness to the CPU's native endianness. You do that with functions like these:

/* CHAR_BIT == 8 assumed */
uint16_t le16_to_cpu(const uint8_t *buf)
{
   return ((uint16_t)buf[0]) | (((uint16_t)buf[1]) << 8);
}
uint16_t be16_to_cpu(const uint8_t *buf)
{
   return ((uint16_t)buf[1]) | (((uint16_t)buf[0]) << 8);
}

You do your fread into an uint8_t buffer of the appropriate size, and then you manually copy all the data bytes over to your BMPHeader struct, converting as necessary. That would look something like this:

/* note adjustments to type definition */
typedef struct BMPHeader
{
    uint8_t magic_number[2];
    uint32_t file_size;
    uint8_t reserved[4];
    uint32_t data_offset;
} BMPHeader;

/* in general this is _not_ equal to sizeof(BMPHeader) */
#define BMP_WIRE_HDR_LEN (2 + 4 + 4 + 4)

/* returns 0=success, -1=error */
int read_bmp_header(BMPHeader *hdr, FILE *fp)
{
    uint8_t buf[BMP_WIRE_HDR_LEN];

    if (fread(buf, 1, sizeof buf, fp) != sizeof buf)
        return -1;

    hdr->magic_number[0] = buf[0];
    hdr->magic_number[1] = buf[1];

    hdr->file_size = le32_to_cpu(buf+2);

    hdr->reserved[0] = buf[6];
    hdr->reserved[1] = buf[7];
    hdr->reserved[2] = buf[8];
    hdr->reserved[3] = buf[9];

    hdr->data_offset = le32_to_cpu(buf+10);

    return 0;
}

You do not assume that the CPU's endianness is the same as the file format's even if you know for a fact that right now they are the same; you write the conversions anyway, so that in the future your code will work without modification on a CPU with the opposite endianness.

You can make life easier for yourself by using the fixed-width <stdint.h> types, by using unsigned types unless being able to represent negative numbers is absolutely required, and by not using integers when character arrays will do. I've done all these things in the above example. You can see that you need not bother endian-converting the magic number, because the only thing you need to do with it is test magic_number[0]=='B' && magic_number[1]=='M'.

Conversion in the opposite direction, btw, looks like this:

void cpu_to_le16(uint8_t *buf, uint16_t val)
{
   buf[0] = (val & 0x00FF);
   buf[1] = (val & 0xFF00) >> 8;
}
void cpu_to_be16(uint8_t *buf, uint16_t val)
{
   buf[0] = (val & 0xFF00) >> 8;
   buf[1] = (val & 0x00FF);
}

Conversion of 32-/64-bit quantities left as an exercise.

Idolatrize answered 19/12, 2011 at 4:9 Comment(6)
If you're going to use uint32_t file_size, the endianness is fixed at LE, so there's on reason not to use uint16_t magic_number.Closestool
No, because you don't fread directly into the BMPHeader object. You fread into uint8_t buf[sizeof(BMPHeader)] and then you manually copy over each field, converting when appropriate; thus using a two-character string for the magic number avoids a conversion. Also I would argue that it is more natural to treat the "magic number" as a two-character string anyway (in this case).Idolatrize
@Zack how would you copy the data in this case?Viera
How do you know that you need to convert LE->BE if you don't look at magic_number to see whether it's 0x424D or 0x4D42?Closestool
@ChaseWalden Worked example added. Note that the intermediate buffer also eliminates the problem you were having with structure padding.Idolatrize
@Closestool You don't ask that question. You always convert, from the defined endianness of the file (LE in this case) to whatever the CPU wants. You don't need to know what endianness the CPU is to do the conversion -- my _to_cpu functions will work regardless.Idolatrize
M
2

I assume this is an endian issue. i.e. You are putting the bytes 42 and 4D into your short value. But your system is little endian (I could have the wrong name), which actually reads the bytes (within a multi-byte integer type) left to right instead of right to left.

Demonstrated in this code:

#include <stdio.h>

int main()
{
    union {
        short sval;
        unsigned char bval[2];
    } udata;
    udata.sval = 1;
    printf( "DEC[%5hu]  HEX[%04hx]  BYTES[%02hhx][%02hhx]\n"
          , udata.sval, udata.sval, udata.bval[0], udata.bval[1] );
    udata.sval = 0x424d;
    printf( "DEC[%5hu]  HEX[%04hx]  BYTES[%02hhx][%02hhx]\n"
          , udata.sval, udata.sval, udata.bval[0], udata.bval[1] );
    udata.sval = 0x4d42;
    printf( "DEC[%5hu]  HEX[%04hx]  BYTES[%02hhx][%02hhx]\n"
          , udata.sval, udata.sval, udata.bval[0], udata.bval[1] );
    return 0;
}

Gives the following output

DEC[    1]  HEX[0001]  BYTES[01][00]
DEC[16973]  HEX[424d]  BYTES[4d][42]
DEC[19778]  HEX[4d42]  BYTES[42][4d]

So if you want to be portable you will need to detect the endian-ness of your system and then do a byte shuffle if required. There will be plenty of examples round the internet of swapping the bytes around.

Subsequent question:

I ask only because my file size is 3 instead of 196662

This is due to memory alignment issues. 196662 is the bytes 36 00 03 00 and 3 is the bytes 03 00 00 00. Most systems need types like int etc to not be split over multiple memory words. So intuitively you think your struct is laid out im memory like:

                          Offset
short magic_number;       00 - 01
int file_size;            02 - 05
short reserved_bytes[2];  06 - 09
int data_offset;          0A - 0D

BUT on a 32 bit system that means files_size has 2 bytes in the same word as magic_number and two bytes in the next word. Most compilers will not stand for this, so the way the structure is laid out in memory is actually like:

short magic_number;       00 - 01
<<unused padding>>        02 - 03
int file_size;            04 - 07
short reserved_bytes[2];  08 - 0B
int data_offset;          0C - 0F

So when you read your byte stream in the 36 00 is going into your padding area which leaves your file_size as getting the 03 00 00 00. Now if you used fwrite to create this data it should have been OK as the padding bytes would have been written out. But if your input is always going to be in the format you have specified it is not appropriate to read the whole struct as one with fread. Instead you will need to read each of the elements individually.

Muskrat answered 19/12, 2011 at 4:7 Comment(7)
Sorry, hit save too early. All there nowMuskrat
+1 for demo, although it'd be nice to make the little-endian assumption here explicit.Idolatrize
Does this only affect a short? I ask only because my file size is 3 instead of 196662Viera
No, it effects all integer types larger than 1 byte, so short, int, long, and long long. If you're using my code as a basis for debugging, you may need to remove/change the h characters in the printf formats. h is for shorts, hh is for unsigned char. Check man 3 printf for details.Muskrat
@Muskrat I didn't use the h characters. I still get problems with the file_sizeViera
@Chase Walden, added section to my answer about memory alignment, which is causing your incorrect file_size issue.Muskrat
Thank you i was extremely frustrated with the problemsViera
W
0

Writing a struct to a file is highly non-portable -- it's safest to just not try to do it at all. Using a struct like this is guaranteed to work only if a) the struct is both written and read as a struct (never a sequence of bytes) and b) it's always both written and read on the same (type of) machine. Not only are there "endian" issues with different CPUs (which is what it seems you've run into), there are also "alignment" issues. Different hardware implementations have different rules about placing integers only on even 2-byte or even 4-byte or even 8-byte boundaries. The compiler is fully aware of all this, and inserts hidden padding bytes into your struct so it always works right. But as a result of the hidden padding bytes, it's not at all safe to assume a struct's bytes are laid out in memory like you think they are. If you're very lucky, you work on a computer that uses big-endian byte order and has no alignment restrictions at all, so you can lay structs directly over files and have it work. But you're probably not that lucky -- certainly programs that need to be "portable" to different machines have to avoid trying to lay structs directly over any part of any file.

Waxbill answered 31/8, 2012 at 2:42 Comment(4)
thank you for sharing your knowledge. this makes sense and I will change the code in the future if I choose to make it more portable.Viera
Blender 3d bases its entire fileformat on reading/writing structs to files, even managing pointers, endian and 32/64 bit conversion. Its non-trivial, but I wouldnt say - "don't to do it at all"Batton
@Batton I disagree completely. Properly reading/writing structs is non-trivial and easy to get wrong in subtle platform-specific ways (such as not being able to share files between machines). Writing platform-agnostic to read/write the fields manually is trivial and hard to get wrong, and not to mention it will either work everywhere or nowhere. Reading/writing structs properly isn't that difficult, but it's certainly more difficult for no benefit.Dietary
Its been working in Blender for 20+ years, giving very fast file IO. disagree that there is "no benefit", If you have many of different structs (100's or more, which change as the software is improved), having to manually read/write takes some effort to write and maintain. There are some constraints on structs (pointers/doubles need to be 8 bytes aligned, even on 32 bit systems), but this can be checked and ensured to be portable. So while you do have a point, in practice it can be made to work quite well. For a single file header - agree its not worth doing.Batton

© 2022 - 2024 — McMap. All rights reserved.