Given a binary file with 32-bit little-endian fields that I need to parse, I want to write parsing code that compiles correctly independent of endianness of machine that executes that code. Currently I use
uint32_t fromLittleEndian(const char* data){
return uint32_t(data[3]) << (CHAR_BIT*3) |
uint32_t(data[2]) << (CHAR_BIT*2) |
uint32_t(data[1]) << CHAR_BIT |
data[0];
}
this, however generate inoptimal assembly. On my machine g++ -O3 -S
produces:
_Z16fromLittleEndianPKc:
.LFB4:
.cfi_startproc
movsbl 3(%rdi), %eax
sall $24, %eax
movl %eax, %edx
movsbl 2(%rdi), %eax
sall $16, %eax
orl %edx, %eax
movsbl (%rdi), %edx
orl %edx, %eax
movsbl 1(%rdi), %edx
sall $8, %edx
orl %edx, %eax
ret
.cfi_endproc
why is this happening? How could I convince it to produce optimal code when compiled on little endian machines:
_Z17fromLittleEndian2PKc:
.LFB5:
.cfi_startproc
movl (%rdi), %eax
ret
.cfi_endproc
which I have gotten by compiling:
uint32_t fromLittleEndian2(const char* data){
return *reinterpret_cast<const uint32_t*>(data);
}
Since I know my machine is little-endian, I know that above assembly is optimal, but it will fail if compiled on big-endian machine. It also violates strict-aliasing rules, so if inlined it might produce UB even on little endian machines. Is there a valid code that will be compiled to optimal assembly if possible?
Since I expect my function to be inlined a lot, any kind of runtime endian detection is out of the question. The only alternative to writing optimal C/C++ code is to use compile time endian detection, and use template
s or #define
s to fall back to the inefficient code if target endian is not little-endian. This however seems to be quite difficult to be done portably.
reinterpret_cast
. It isn't doing any byte reordering. If you have to dance the endian byte shuffle, you have to pay the band. – Devesthtonl()
be insignificant compared to the actual time you spend reading data from theHDD
? – CrabbyfromLittleEndian
works well, and probably quicker than anything involving callinghton();
and freinds. And hdd throughput is likely to be much slower, I realise that. It's just that it's bugging me that I cannot get optimal assembly - This feels like something that should have been solved ages ago ;) – Hoesreinterpret_cast
. Or do I expect too much from optimizer? – Hoes#define
provided by the compilers you are willing to support (and maybe some compiler intrinsic to swap the bytes); gcc provides__BYTE_ORDER__
and__bswap_32
, the other compilers will have something similar. Even better, you can just use boost.Endian and delegate the problem of dealing with the various compilers to them. – Rhyolitehton
": at least on gcc on Linux, I wouldn't bet on it; the code generated forhtonl
is probably optimal, the one with the naive shifts - I wouldn't say so. – Rhyolitehtonl
into a singlebswap
instruction on x86. There's no beating that, no matter how much templates you try to throw at the problem. If you really want to, you can wraphton(x)
functions in your own ones and provide inline optimized implementations for other not-so-smart compilers. – Twelfthtide<endian.h>
is for (man 3 endian), though it's nonstandard. – Twelfthtide<endian.h>
on gcc mailing list: gcc.gnu.org/ml/gcc-help/2007-07/msg00342.html - it is a bit dated, but I don't think much has changed, and they don't even go as far as other compilers there. Since I already use autotools, I should probably just defer the problem to it. Still, it feels like going around the problem; compiler alone shold be able to deliver this info. – Hoesbswap
ormovbe
. The most reliable way for compilers that support GNU C seems to be to use thehtobe64
orhtole32
functions that are wrappers around__builtin_bswap64
and similar. Portable fallbacks are possible for other compilers. – Dismember