using memcpy to convert from array to int
Asked Answered
K

1

7

I was experimenting with pointer manipulation and decided to try converting an array of numbers into an integer by directly copying from memory using memcpy.

char aux[4] = {1,2,3,4}; 
int aux2 = 0;
memcpy((char*) &aux2, &aux[0], 4);
printf("%X", aux2);

I expected the result to be 0x1020304 since I'm copying the exact bytes from one to another, but printf gives me the result 0x4030201, which is almost my desired output, only backwards. Why does this happen and is there a way to get the result in the "correct" order?

Killam answered 9/2, 2021 at 18:53 Comment(5)
EndiannessSuccursal
You expected wrong. Your CPU (ISA) uses a different order.Vallation
You're on a little endian architecture, where the least significant bytes come first in memory (at lower addresses).Wooded
%X is only for printing unsigned int -- you should make aux2 unsignedCatechu
I wrote FAQ answer about this the other week here: What is CPU endianness?Dariusdarjeeling
P
8

Your code has at best implementation defined behavior and in some cases undefined behavior.

Type int may have a size different from 4: on 16-bit systems, int typically has a size of only 2 bytes. You would have undefined behavior on such systems.

On regular 32-bit systems, int has 4 bytes, but the order in which the 4 bytes are stored in memory is implementation defined, a problem referred to as endianness:

  • some systems use big-endian representation, where the first byte is the most significant part of the integer. Bytes 01 02 03 04 represent the value 0x01020304 on big-endian systems, such as older Macs, some mobile phones and embedded systems.

  • conversely, most personal computers today use little-endian representation, where the first byte contains the least significant part of the integer. Bytes 01 02 03 04 represent the value 0x04030201 on little-endian systems, such as yours.

  • The C Standard does not exclude other representations, where bytes would be in some other order. This was the case on some ancient DEC systems: the PDP-11, where the C language was originally developped (middle-endian or mixed-endian).

Albeit surprising, the little-endian order is very logical as the byte at offset n contains the bits representing values between 2n*8 and 2n*8+7. Endianness is a cultural issue, both choices seem natural to long time users.

The same variations are found in other contexts, such as the ordering of date components:

  • Japan uses big-endian representation: February 17 2021 is written 2021.02.17,

  • Europe uses little-endian representation: February 17 2021 is written 17/02/2021,

  • The USA use a middle-endian representation: February 17 2021 is written 02/17/2021.

  • 21 is pronounced twenty-one in English (big-endian) whereas Germans say einundzwanzig (one and twenty, little endian and actually middle-endian for 3-digit numbers). But then 17 is seventeen (little-endian) and in French dix-sept (big-endian).

  • Western languages write numbers in big-endian format (I am 42 years old) but semitic scripts use little-endian order: Hebrew (אני בת 42) and Arabic (أنا ٤٢ سنة) both use little-endian as they are read from right to left.

Here is a more portable version to test memory representation:

#include <stdio.h>
#include <string.h>

int main() {
    unsigned int aux2 = 0x01020304;
    unsigned char aux[sizeof(unsigned int)]; 
    memcpy(&aux, aux2, sizeof(aux));
    printf("%X is represented in memory as", aux2);
    for (size_t i = 0; i < sizeof(aux); i++)
        printf(" %02X", aux[i]);
    printf("\n");
    return 0;
}
Phosphoresce answered 9/2, 2021 at 20:46 Comment(7)
Nice answer. Detail: "English (big-endian)" --> English numbers have inconsistent endiand as in 17 "seven-ten".Rossiter
OK, so what endianness is the French pronunciation of "80"? ;-)Guendolen
@AndrewHenle Or: 97: "quatre-vingt-dix-sept" --> 4*20 10 7.Rossiter
"both use little-endian as they are read from right to left." --> Hmmm, I do not see endian as a right-left vs. right-left issue, but a "what is read/spoken/encoded first issue.Rossiter
@chux-ReinstateMonica: 97 is a good one :) still big-endian, but using base-20, a system with many examples in ancient and current history known as VigesimalPhosphoresce
@chux-ReinstateMonica: I agree that the semitic script examples are somewhat misleading because when writing Hebrew, 42 is typed as 4 then 2 and is displayed as 42 because of the intrinsic left-to-right ordering of the corresponding code-points, the same applies to the Arabic numerals ٤٢.Phosphoresce
Thank you, that does clear things up. I'm actually developing for a PIC microcontroller, and when I noticed this behavior I tried running similar code into an online compiler just for checking, and when both had the same results I thought it was the expected behavior for any machine. Also, in my microcontroller I was using an uint32_t instead of int, just to be sure it had 4 bytes. But that was very helpful, thank you!Killam

© 2022 - 2024 — McMap. All rights reserved.