What is the endianness of binary literals in C++14?
Asked Answered
P

7

47

I have tried searching around but have not been able to find much about binary literals and endianness. Are binary literals little-endian, big-endian or something else (such as matching the target platform)?

As an example, what is the decimal value of 0b0111? Is it 7? Platform specific? Something else? Edit: I picked a bad value of 7 since it is represented within one byte. The question has been sufficiently answered despite this fact.

Some background: Basically I'm trying to figure out what the value of the least significant bits are, and masking it with binary literals seemed like a good way to go... but only if there is some guarantee about endianness.

Peephole answered 18/12, 2014 at 16:21 Comment(12)
Binary literals work exactly the same way as decimal literals, except they are written in binary rather than decimal. They have no endianness.Scotland
I am genuinely curious: what are the down-votes and close-votes for? I am active on SO, but not the C++ community. What is bad about this question? It doesn't seem to be a duplicate, and it's a direct technical question. Can I get some further direction, please?Peephole
endianness at the byte level has no meaning. endianness only means something when you try to interpret multibyte numbers (say int16) as a sequence of bytes. There is no way to do that to a single byteConcern
@LeviMorrison You're asking for something that doesn't exist. c++ or c++11 have no notion of endianess, it's a machine architecture property.Anaya
There's nothing in particular wrong with the question. It seems to be more about a confusion of what endianness means (or possibly what number literals are), but I don't see how that's a problem.Scotland
@Cubic: Of course decimal literals have endianness. That's why 7x3 is 21 and not 12. Any ordered sequence of digits, regardless of base has an endianness. Since the order can be ascending or descending, there's naturally big-endian and little-endian. ("middle-endian" being those weird 3412 unordered sequences)Rayburn
@Rayburn I was pretty sure that he was talking about memory layout, not about the actual literal syntax.Scotland
It would be a bit weird to talk about the memory layout of a literal, because it doesn't have one. Literals exist in early phases of compilation, while memory layout is a runtime thing (at best the code generation phase).Rayburn
My 2 cents on this: endianess are always a byte wise, not bit wise. (0b0111 will be 7 in any platform - that we currently have, may someone invent something that read bits in backwards orders, who knows). Second, code wise a constant number will be always big-endian, unless you put in a byte array then cast it back as something else like a Int32, but the compiler will try the best to get you the right number.Ghats
C++11 does not have binary literals. C++14 does.Triennium
@LucasLocatelli: We are lucky and there are no machines where memory layout goes to the bit level, but it isn't impossible. Basically "memory layout" answers the question "what happens if I memcpy this thing into an array of unsigned char". If you memcpy a 32 bit unsigned int with value 1 into an array of four 8-bit unsigned chars, then in practice either the first or last byte will be 1 and all the others 0. But in theory, each of the 32 bits could be the one that is set. Old segmented pointers did have "interesting" memory layout.Facilitation
in adition I will say even compiler take no care about, for example in LLVM platform only the backend (technically not a compiler) will take care of endianess.Literate
W
75

Short answer: there isn't one. Write the number the way you would write it on paper.

Long answer: Endianness is never exposed directly in the code unless you really try to get it out (such as using pointer tricks). 0b0111 is 7, it's the same rules as hex, writing

int i = 0xAA77;

doesn't mean 0x77AA on some platforms because that would be absurd. Where would the extra 0s that are missing go anyway with 32-bit ints? Would they get padded on the front, then the whole thing flipped to 0x77AA0000, or would they get added after? I have no idea what someone would expect if that were the case.

The point is that C++ doesn't make any assumptions about the endianness of the machine*, if you write code using primitives and the literals it provides, the behavior will be the same from machine to machine (unless you start circumventing the type system, which you may need to do).

To address your update: the number will be the way you write it out. The bits will not be reordered or any such thing, the most significant bit is on the left and the least significant bit is on the right.


There seems to be a misunderstanding here about what endianness is. Endianness refers to how bytes are ordered in memory and how they must be interpretted. If I gave you the number "4172" and said "if this is four-thousand one-hundred seventy-two, what is the endianness" you can't really give an answer because the question doesn't make sense. (some argue that the largest digit on the left means big endian, but without memory addresses the question of endianness is not answerable or relevant). This is just a number, there are no bytes to interpret, there are no memory addresses. Assuming 4 byte integer representation, the bytes that correspond to it are:

        low address ----> high address
Big endian:    00 00 10 4c
Little endian: 4c 10 00 00

so, given either of those and told "this is the computer's internal representation of 4172" you could determine if its little or big endian.

So now consider your binary literal 0b0111 these 4 bits represent one nybble, and can be stored as either

              low ---> high
Big endian:    00 00 00 07
Little endian: 07 00 00 00

But you don't have to care because this is also handled by the hardware, the language dictates that the compiler reads from left to right, most significant bit to least significant bit

Endianness is not about individual bits. Given that a byte is 8 bits, if I hand you 0b00000111 and say "is this little or big endian?" again you can't say because you only have one byte (and no addresses). Endianness doesn't pertain to the order of bits in a byte, it refers to the ordering of entire bytes with respect to address(unless of course you have one-bit bytes).

You don't have to care about what your computer is using internally. 0b0111 just saves you the time from having to write stuff like

unsigned int mask = 7; // only keep the lowest 3 bits

by writing

unsigned int mask = 0b0111;

Without needing to comment explaining the significance of the number.


* In c++20 you can check the endianness using std::endian.

Wheatear answered 18/12, 2014 at 16:23 Comment(18)
@Jongware Well, you can use a union trick to find out the endianess.Anaya
@Jongware I was assuming we're talking about byte endianness here, which can be checked by casting to char* and then comparing.Scotland
@πάνταῥεῖ doing the union check would violatethe rules on unions, you could do: int i = 1; char *cp = (char*)i; then *cp == 1 would be true if it's little endianWheatear
@Medinoc People generally should be writing endian-agnostic code anyway.Benadryl
I would like to point out that at a sufficiently low level of programming you cannot avoid endianness because the specifications of whatever you are implementing mandate their inputs or outputs to be in little/big/whatever endian. That includes network protocols, cryptographic algorithms, and so on. Just because you don't do these things doesn't mean they don't exist, and endianness does leak out of the nice comfy type system in these situations. So the "too clever for your own good" part seems unwarranted.Brandabrandais
@Medinoc As far as I've seen it done endianness is rarely checked at runtime anyway, it's usually assumed to be one or the other at build time by deducing it from the target architecture, and even then the OS tends to do the work for you in most cases (e.g. linux/bsd and their endian.h headers, OS X also has an analogous header, windows doesn't but windows is always little endian anyway)Brandabrandais
"one might argue"... well, 4172 is big-endian base 10 representation. Our number system and arithmetic tricks would work just as well if you write it as 2714 (and we know that this is taken to mean 2+70+100+4000). The more common examples of endianness in programming are the same thing but with base 256 instead of base 10.Eraser
@Brandabrandais revised with that in mind. I suppose someone has to implement htons after all.Wheatear
@MattMcNabb I haven't seen anything discussing endianness that wasn't directly tied to hardware, outside of the comparison drawn between big-endian and our convential base-10 notation. I'm unconvinced that it's correct to say that "4172" is big-endian anything afaict endianness is coupled with bytes inside of a machine, excuse me if I'm too pedantic with the terminology.Wheatear
@RyanHaining Using the htons from your comment: that is easy to implement without making any assumptions about endianness: uint16_t htons(uint16_t x) { uint16_t result; unsigned char *p = (unsigned char *) &result; p[0] = x >> 8; p[1] = x; return result; } It does make some assumptions about the representation of uint16_t, but endianness is not one of those assumptions, and at least clang optimises this very well. I agree with the comment that people should generally be writing code that does not make assumptions about endianness, it is just not necessary.Wystand
@hvd I suppose I should say that while one might not need to know what endianness their internal representation uses, they can't be "endian-oblivous" since network programming generally requires that you know it exists, whether or not you know what format you're using. But yeah I hadn't thought too much about the htons implementation when I wrote that, very good point.Wheatear
@Ryan, I think you meant *cp = (char*) &i;.Mannie
@Thomas: When you write to or read from an external format (whether a file or socket or whatever), the byte order of that format should be well defined, and yes, you will need to make sure that you write or read bytes in the correct order. However, that does not mean that your code needs to care what the native endianness is of the platform you're running on, hence you should not need to do compile-time or run-time endianness checks.Benadryl
@Benadryl How about when you read a raw byte array and need to parse them into an array of little-endian 64-bit integers because that's what the specification requires?Brandabrandais
@Brandabrandais Why would a specification make a requirement on how the integers are internally stored in your program? If the specification says that a file format stores an array of little-endian 64-bit integers, then you read them as little-endian 64-bit integers, but your program's internal representation of them shouldn't matter.Benadryl
@Benadryl Read my comment again: the program is simply passed a byte array (which has no endianness: it's a well-defined sequence of bytes) and needs to convert it into an array of 64-bit little-endian integers (perhaps to do operations on them, so that machines with different endianness produce the same result given the same byte array). You can do this task the slow, inefficient (but portable) way, by reading the bytes in one by one and building the 64-bit integers as you go, or you can do it the fast way, by blitting the 8-byte block into a 64-bit integer and byte-swapping if big endian.Brandabrandais
@Brandabrandais If you want to do operations on those integers, you should be converting them to the platform's native endianness, whichever it might be. And I'm just saying that the portable, endian-agnostic way to do it is an option. I acknowledge that it might not be the fastest solution, but I'd rather general advice be to write portable code first, then profile, and only then start caring about native endianness if you've truly determined that it actually matters.Benadryl
@Benadryl Integers don't have endianness. Their memory representation does... but the point is that you need to parse the byte array correctly to obtain the same integer representations from the bytes on systems with different endianness. And, yes, I certainly agree that taking advantage of endianness is usually an optimization, but in many cases it is such a low-level operation that it is done so frequently it would be a complete waste of time not to exploit your knowledge of the system's endianness (be it at compile-time or runtime). This happens all the time in crypto implementations.Brandabrandais
H
45

All integer literals, including binary ones are interpreted in the same way as we normally read numbers (left most digit being most significant).

The C++ standard guarantees the same interpretation of literals without having to be concerned with the specific environment you're on. Thus, you don't have to concern yourself with endianness in this context.

Your example of 0b0111 is always equal to seven.

The C++ standard doesn't use terms of endianness in regards to number literals. Rather, it simply describes that literals have a consistent interpretation, and that the interpretation is the one you would expect.

C++ Standard - Integer Literals - 2.14.2 - paragraph 1

An integer literal is a sequence of digits that has no period or exponent part, with optional separating single quotes that are ignored when determining its value. An integer literal may have a prefix that specifies its base and a suffix that specifies its type. The lexically first digit of the sequence of digits is the most significant. A binary integer literal (base two) begins with 0b or 0B and consists of a sequence of binary digits. An octal integer literal (base eight) begins with the digit 0 and consists of a sequence of octal digits. A decimal integer literal (base ten) begins with a digit other than 0 and consists of a sequence of decimal digits. A hexadecimal integer literal (base sixteen) begins with 0x or 0X and consists of a sequence of hexadecimal digits, which include the decimal digits and the letters a through f and A through F with decimal values ten through fifteen. [Example: The number twelve can be written 12, 014, 0XC, or 0b1100. The literals 1048576, 1’048’576, 0X100000, 0x10’0000, and 0’004’000’000 all have the same value. — end example ]

Wikipedia describes what endianness is, and uses our number system as an example to understand big-endian.

The terms endian and endianness refer to the convention used to interpret the bytes making up a data word when those bytes are stored in computer memory.

Big-endian systems store the most significant byte of a word in the smallest address and the least significant byte is stored in the largest address (also see Most significant bit). Little-endian systems, in contrast, store the least significant byte in the smallest address.

An example on endianness is to think of how a decimal number is written and read in place-value notation. Assuming a writing system where numbers are written left to right, the leftmost position is analogous to the smallest address of memory used, and rightmost position the largest. For example, the number one hundred twenty three is written 1 2 3, with the hundreds place left-most. Anyone who reads this number also knows that the leftmost digit has the biggest place value. This is an example of a big-endian convention followed in daily life.

In this context, we are considering a digit of an integer literal to be a "byte of a word", and the word to be the literal itself. Also, the left-most character in a literal is considered to have the smallest address.

With the literal 1234, the digits one, two, three and four are the "bytes of a word", and 1234 is the "word". With the binary literal 0b0111, the digits zero, one, one and one are the "bytes of a word", and the word is 0111.

This consideration allows us to understand endianness in the context of the C++ language, and shows that integer literals are similar to "big-endian".

Higgler answered 18/12, 2014 at 16:34 Comment(9)
Big endian is the order which is readable to humans, because the big digits are encoded first. Little endian encodes the small digits first effectively reversing their order.Doorknob
Big endian = most significant byte first, little endian = least significant byte firstCretonne
That's the case for big endian systems.Diva
Please read en.wikipedia.org/wiki/Endianness . Quote: "Big-endian systems store the most significant byte of a word in the smallest address"Doorknob
@cmaster Smallest address = left = first. Of course we usually don't use the term endianness for number strings at all, and only for the layout in memory. So one can either say that the term "endianness" does not apply to literals at all, or that they're always bigendian. Saying that literals are are always little endian is definitely wrong.Cretonne
@Cretonne Sorry, I misread your comment X-| I have corrected mine now.Doorknob
@cmaster That's not readability to humans. It's simply convention. Perhaps "readable for someone brought up in the larger current global civilization"Scotland
@Scotland generally when we say "human-readable" we mean a human from earth, yes.Wheatear
@RyanHaining That comment is nearly 5 years old, but if I recall correctly my objection at the time was the idea that there somehow was some sort of biological reason that'd be the case, rather than it being just another convention that may or may not be common.Scotland
P
10

You're missing the distinction between endianness as written in the source code and endianness as represented in the object code. The answer for each is unsurprising: source-code literals are bigendian because that's how humans read them, in object code they're written however the target reads them.

Since a byte is by definition the smallest unit of memory access I don't believe it would be possible to even ascribe an endianness to any internal representation of bits in a byte -- the only way to discover endianness for larger numbers (whether intentionally or by surprise) is by accessing them from storage piecewise, and the byte is by definition the smallest accessible storage unit.

Phosphoric answered 18/12, 2014 at 17:58 Comment(2)
In the sense of arithmetic operators, the abstract machine says the bits in an integral type are big-endian: right shifting a number produces something smaller. Of course, this has nothing to do with how bits or bytes are stored in memory devices.Poppycock
@Hurkyl exactly. You can't tell whether machine registers are bigendian or not because those are never exposed -- there's no reason at all to expose any endianness but bigendianness in registers, because the whole point of littlendian was compatibility with soda-straw 8bit data busses to external storage or devices.Phosphoric
C
7

The C/C++ languages don't care about endianness of multi-byte integers. C/C++ compilers do. Compilers parse your source code and generate machine code for the specific target platform. The compiler, in general, stores integer literals the same way it stores an integer; such that the target CPU's instructions will directly support reading and writing them in memory.

The compiler takes care of the differences between target platforms so you don't have to.

The only time you need to worry about endianness is when you are sharing binary values with other systems that have different byte ordering.Then you would read the binary data in, byte by byte, and arrange the bytes in memory in the correct order for the system that your code is running on.

Crore answered 18/12, 2014 at 17:22 Comment(5)
You also need to worry about endianness if you manipulate data via char pointers.Doorknob
If the char pointer is pointing to an int, you can cast it to an int pointer and use it as such.Crore
@TheronWGenaux: Not always - it might not be guaranteed that the int is aligned correctly.Melise
@psmears: Very true. I remember, I think it was the 8086 processor, alignment wasn't required. I was helping someone figure out why it was running so slow. We found the the stack was set to an odd address and it was doing 2 reads/writes for every push/pop on the stack.Crore
@TheronWGenaux: Haha, that one must have been fun to debug! Yes, the x86 processors default to simulating the unaligned read, which works (albeit slowly); the same code on another processor will generate a bus error. This is fun when you're coding and testing on x86, then deploying to a different (e.g. embedded) CPU...Melise
A
5

One picture is sometimes more than thousand words.

source vs. memory endianness

Anyway answered 16/5, 2016 at 13:26 Comment(1)
Best Answer. Literals in C++ source are big endian, like we normally represent base 10 numbers in math. The memory ordering of the bytes will differ based on your hardware.Ravens
S
0

Endianness is implementation-defined. The standard guarantees that every object has an object representation as an array of char and unsigned char, which you can work with by calling memcpy() or memcmp(). In C++17, it is legal to reinterpret_cast a pointer or reference to any object type (not a pointer to void, pointer to a function, or nullptr) to a pointer to char, unsigned char, or std::byte, which are valid aliases for any object type.

What people mean when they talk about “endianness” is the order of bytes in that object representation. For example, if you declare unsigned char int_bytes[sizeof(int)] = {1}; and int i; then memcpy( &i, int_bytes, sizeof(i)); do you get 0x01, 0x01000000, 0x0100, 0x0100000000000000, or something else? The answer is: yes. There are real-world implementations that produce each of these results, and they all conform to the standard. The reason for this is so the compiler can use the native format of the CPU.

This comes up most often when a program needs to send or receive data over the Internet, where all the standards define that data should be transmitted in big-endian order, on a little-endian CPU like the x86. Some network libraries therefore specify whether particular arguments and fields of structures should be stored in host or network byte order.

The language lets you shoot yourself in the foot by twiddling the bits of an object representation arbitrarily, but it might get you a trap representation, which could cause undefined behavior if you try to use it later. (This could mean, for example, rewriting a virtual function table to inject arbitrary code.) The <type_traits> header has several templates to test whether it is safe to do things with an object representation. You can copy one object over another of the same type with memcpy( &dest, &src, sizeof(dest) ) if that type is_trivially_copyable. You can make a copy to correctly-aligned uninitialized memory if it is_trivially_move_constructible. You can test whether two objects of the same type are identical with memcmp( &a, &b, sizeof(a) ) and correctly hash an object by applying a hash function to the bytes in its object representation if the type has_unique_object_representations. An integral type has no trap representations, and so on. For the most part, though, if you’re doing operations on object representations where endianness matters, you’re telling the compiler to assume you know what you’re doing and your code will not be portable.

As others have mentioned, binary literals are written with the most-significant-digit first, like decimal, octal or hexidecimal literals. This is different from endianness and will not affect whether you need to call ntohs() on the port number from a TCP header read in from the Internet.

Shaughn answered 27/4, 2018 at 22:34 Comment(0)
D
-6

You might want to think about C or C++ or any other language as being intrinsically little endian (think about how the bitwise operators work). If the underlying HW is big endian, the compiler ensures that the data is stored in big endian (ditto for other endianness) however your bit wise operations work as if the data is little endian. Thing to remember is that as far as the language is concerned, data is in little endian. Endianness related problems arise when you cast the data from one type to the other. As long as you don't do that you are good.

I was questioned about the statement "C/C++ language as being intrinsically little endian", as such I am providing an example which many knows how it works but well here I go.

typedef union
{
    struct {
        int a:1;
        int reserved:31;
    } bits;

    unsigned int value;
} u;

u test;
test.bits.a = 1;
test.bits.reserved = 0;

printf("After bits assignment, test.value = 0x%08X\n", test.value);

test.value = 0x00000001;

printf("After value assignment, test.value = 0x%08X\n", test.value);

Output on a little endian system:

After bits assignment, test.value = 0x00000001
After value assignment, test.value = 0x00000001

Output on a big endian system:

After bits assignment, test.value = 0x80000000
After value assignment, test.value = 0x00000001

So, if you do not know the processor's endianness, where does everything come out right? in the little endian system! Thus, I say that the C/C++ language is intrinsically little endian.

Dallas answered 18/12, 2014 at 18:17 Comment(2)
Comments are not for extended discussion; this conversation has been moved to chat.Maryannmaryanna
One could write a similar check in an assembly language or any other language that has pointers. So this code only shows that "little-endian is more natural than big-endian"; this doesn't apply specifically to C/C++. Also, this has absolutely nothing to do about binary literals in the question.Coburn

© 2022 - 2024 — McMap. All rights reserved.