Optimal and portable conversion of endian in c/c++

Asked 15/4, 2016 at 22:34 Answered 15/4, 2016 at 23:59

Given a binary file with 32-bit little-endian fields that I need to parse, I want to write parsing code that compiles correctly independent of endianness of machine that executes that code. Currently I use

uint32_t fromLittleEndian(const char* data){
  return uint32_t(data[3]) << (CHAR_BIT*3) |
         uint32_t(data[2]) << (CHAR_BIT*2) |
         uint32_t(data[1]) << CHAR_BIT |
         data[0]; 
}

this, however generate inoptimal assembly. On my machine g++ -O3 -S produces:

_Z16fromLittleEndianPKc:
.LFB4:
    .cfi_startproc
    movsbl  3(%rdi), %eax
    sall    $24, %eax
    movl    %eax, %edx
    movsbl  2(%rdi), %eax
    sall    $16, %eax
    orl %edx, %eax
    movsbl  (%rdi), %edx
    orl %edx, %eax
    movsbl  1(%rdi), %edx
    sall    $8, %edx
    orl %edx, %eax
    ret
    .cfi_endproc

why is this happening? How could I convince it to produce optimal code when compiled on little endian machines:

_Z17fromLittleEndian2PKc:
.LFB5:
    .cfi_startproc
    movl    (%rdi), %eax
    ret
    .cfi_endproc

which I have gotten by compiling:

uint32_t fromLittleEndian2(const char* data){
    return *reinterpret_cast<const uint32_t*>(data);
}

Since I know my machine is little-endian, I know that above assembly is optimal, but it will fail if compiled on big-endian machine. It also violates strict-aliasing rules, so if inlined it might produce UB even on little endian machines. Is there a valid code that will be compiled to optimal assembly if possible?

Since I expect my function to be inlined a lot, any kind of runtime endian detection is out of the question. The only alternative to writing optimal C/C++ code is to use compile time endian detection, and use templates or #defines to fall back to the inefficient code if target endian is not little-endian. This however seems to be quite difficult to be done portably.

Hoes answered 15/4, 2016 at 22:34 Comment(14)

You can't match reinterpret_cast. It isn't doing any byte reordering. If you have to dance the endian byte shuffle, you have to pay the band. – Devest 15/4, 2016 at 22:39

The thing is that if my compile-target platform is little endian then I don't need byte shuffle - compiler should also know that, but it produces byte shuffled code anyway. – Hoes 15/4, 2016 at 22:42

Thing is the compiler doesn't know you're flipping endian. It just sees a bunch of shifts and ors. Would be a nice trick to have, though. Could play down at the makefile level and compile and link in the correct function, but that'll kill any inlining. – Devest 15/4, 2016 at 22:50

Given that you are parsing files won't calling something like htonl() be insignificant compared to the actual time you spend reading data from the HDD? – Crabby 15/4, 2016 at 22:52

@Crabby I could but I don't want to link my code to inet just for this little thing. My fromLittleEndian works well, and probably quicker than anything involving calling hton(); and freinds. And hdd throughput is likely to be much slower, I realise that. It's just that it's bugging me that I cannot get optimal assembly - This feels like something that should have been solved ages ago ;) – Hoes 15/4, 2016 at 22:58

@Devest The compiler doesn't need to know I am flipping endian. Given that it knows target-machne-endian, it should be able to tell that above code is equivalent to reinterpret_cast. Or do I expect too much from optimizer? – Hoes 15/4, 2016 at 23:0

AFAICT you cannot know via templates - the only way to find out endianness is essentially by reinterpreting the data through a pointer of different type, and that's not allowed in templates. Personally, I'd just use some #define provided by the compilers you are willing to support (and maybe some compiler intrinsic to swap the bytes); gcc provides __BYTE_ORDER__ and __bswap_32, the other compilers will have something similar. Even better, you can just use boost.Endian and delegate the problem of dealing with the various compilers to them. – Rhyolite 15/4, 2016 at 23:1

I agree. A smart enough compiler should be able to generate a lovely tree, shuffle it down to it's minimum, pretty much what you have there, and then do a walk through the logic to see that nothing whatsoever happened and essentially throw the whole thing out in favor of a copy. But it doesn't look like we are there yet. – Devest 15/4, 2016 at 23:7

By the way, about the "probably quicker than hton": at least on gcc on Linux, I wouldn't bet on it; the code generated for htonl is probably optimal, the one with the naive shifts - I wouldn't say so. – Rhyolite 15/4, 2016 at 23:11

@Hoes : gcc is smart enough to turn htonl into a single bswap instruction on x86. There's no beating that, no matter how much templates you try to throw at the problem. If you really want to, you can wrap hton(x) functions in your own ones and provide inline optimized implementations for other not-so-smart compilers. – Twelfthtide 15/4, 2016 at 23:19

@MatteoItalia Nice to know - didn't expect that. The only problem is that network order is big endian. – Hoes 15/4, 2016 at 23:26

@Hoes : Then that's what <endian.h> is for (man 3 endian), though it's nonstandard. – Twelfthtide 15/4, 2016 at 23:27

@DanielKamilKozar I found a nice discussion of <endian.h> on gcc mailing list: gcc.gnu.org/ml/gcc-help/2007-07/msg00342.html - it is a bit dated, but I don't think much has changed, and they don't even go as far as other compilers there. Since I already use autotools, I should probably just defer the problem to it. Still, it feels like going around the problem; compiler alone shold be able to deliver this info. – Hoes 15/4, 2016 at 23:36

A very similar question was asked recently about writing endian-agnostic code that compiled non-horribly. It's fiddly to accomplish for code that gcc and clang can both optimize to bswap or movbe. The most reliable way for compilers that support GNU C seems to be to use the htobe64 or htole32 functions that are wrappers around __builtin_bswap64 and similar. Portable fallbacks are possible for other compilers. – Dismember 17/4, 2016 at 1:44

Various platform libraries that I know of do this by #defining macros for the endian-swapping routines, based on the value of #define BIG_ENDIAN. In the cases where the source endianness matches your target endianness, you can just:

#ifdef LITTLE_ENDIAN
    #define fromLittleEndian(x) (x)
#else
    #define fromLittleEndian(x) _actuallySwapLittle((x))
#endif

For example:

http://man7.org/linux/man-pages/man3/endian.3.html

http://fxr.watson.org/fxr/source/sys/endian.h

Cytogenesis answered 15/4, 2016 at 23:49 Comment(4)

<endian.h> doesn't seem to be portable. See gcc.gnu.org/ml/gcc-help/2007-07/msg00342.html – Hoes 15/4, 2016 at 23:55

You have to choose here - optimal or portable. @Hoes has a portable version that's non-optimal. Various other answers will suggest other techniques that are more-or-less portable or optimal, but the only way to ensure that you get output that does nothing in the do-nothing case is to use the preprocessor. There are no guarantees that any given C++ compiler will recognize the do-nothing case otherwise. – Cytogenesis 16/4, 2016 at 0:24

I guess an answer is to attempt to detect target platform, and if unsuccessful, then use suboptimal code as fall-back. – Hoes 16/4, 2016 at 11:33

The most recent versions of BOOST has an endian library, which I believe is highly portable. – Purusha 16/4, 2016 at 12:17

short answer - use htonl - its gonna be optimized up the wazzoo

Campman answered 15/4, 2016 at 23:59 Comment(3)

The only problem is that network order is big endian. – Hoes 16/4, 2016 at 0:6

yup and htonl will know that and convert or not depending on the machine its running on – Campman 18/4, 2016 at 16:2

I know that, but htonl and friends are always converting from/to machine endian to/from big endian (network endian). My file is by definition little-endian, and I need a fuction set converting from/to machine endian to/from little endian. There is no way I see I could use htonl or ntohl to solve my problem, except maybe to convert always to big endian and then always do some byte shuffling anyway. This is not likely to be close to optimal. – Hoes 19/4, 2016 at 2:21

#ifdef LITTLE_ENDIAN
    #define fromLittleEndian(x) (x)
#else
    #define fromLittleEndian(x) _actuallySwapLittle((x))
#endif