How do I organize members in a struct to waste the least space on alignment?
Asked Answered
K

7

61

[Not a duplicate of Structure padding and packing. That question is about how and when padding occurs. This one is about how to deal with it.]

I have just realized how much memory is wasted as a result of alignment in C++. Consider the following simple example:

struct X
{
    int a;
    double b;
    int c;
};

int main()
{
    cout << "sizeof(int) = "                      << sizeof(int)                      << '\n';
    cout << "sizeof(double) = "                   << sizeof(double)                   << '\n';
    cout << "2 * sizeof(int) + sizeof(double) = " << 2 * sizeof(int) + sizeof(double) << '\n';
    cout << "but sizeof(X) = "                    << sizeof(X)                        << '\n';
}

When using g++ the program gives the following output:

sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 24

That's 50% memory overhead! In a 3-gigabyte array of 134'217'728 Xs 1 gigabyte would be pure padding.

Fortunately, the solution to the problem is very simple - we simply have to swap double b and int c around:

struct X
{
    int a;
    int c;
    double b;
};

Now the result is much more satisfying:

sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 16

There is however a problem: this isn't cross-compatible. Yes, under g++ an int is 4 bytes and a double is 8 bytes, but that's not necessarily always true (their alignment doesn't have to be the same either), so under a different environment this "fix" could not only be useless, but it could also potentially make things worse by increasing the amount of padding needed.

Is there a reliable cross-platform way to solve this problem (minimize the amount of needed padding without suffering from decreased performance caused by misalignment)? Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?

Clarification

Due to misunderstanding and confusion, I'd like to emphasize that I don't want to "pack" my struct. That is, I don't want its members to be unaligned and thus slower to access. Instead, I still want all members to be self-aligned, but in a way that uses the least memory on padding. This could be solved by using, for example, manual rearrangement as described here and in The Lost Art of Packing by Eric Raymond. I am looking for an automated and as much cross-platform as possible way to do this, similar to what is described in proposal P1112 for the upcoming C++20 standard.

Kaitlin answered 25/6, 2019 at 20:29 Comment(15)
If you need "arrays" of hundreds of millions of elements, then perhaps arrays is not the correct data-structure to begin with? At least not in-memory arrays (think memory mapped files, or perhaps even some kind of database)?Immoral
And really, the only possible answer to the question "[i]s there a reliable cross-platform way to solve this problem (minimize the amount of needed padding without suffering from decreased performance caused by misalignment)?" could only be a simple "no". There are probably compiler and system specific ways to work around it, but nothing truly portable or compiler/platform/system agnostic.Immoral
May be some portability benefits from using fixed width integers so they don't change size on you.Scrivner
And regarding "[w]hy doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?" How could the compiler do that when it can't tell what the structure is used for? Perhaps it will be stored raw in a binary file, or sent over a serial communication protocol (in which case unpacked structures (manually or by compiler pragma) are really a bad idea, but it still happens).Immoral
largest alignment requirements first. If none, then largest members first. Regarding your real question, yes there is a cross-compatible method for doing this: it's called a string. Outside of that, types using specified bit widths can help significantly, but still require endian handling if you're really serious about cross platform. In short, protocols exist specifically to address such issues and bridge the hard differences between platforms. Things like this are one of many reasons why they exist, Caveat: Good chance I completely misunderstood the "this" of this question.Recapitulate
Lastly, this feels very much like an XY problem to me. Rearranging structures is a solution, but what is the real problem behind this solution? What are you really trying to accomplish? Why do you need arrays of millions of structures? Perhaps there are other possible solutions to that original problem, ones that doesn't involve arrays or which makes possible padding irrelevant?Immoral
For all the reasons above, there is no one thing that guarantees a minimum storage for struct size, but @Recapitulate provides a precise explanation of the oversimplified rule Biggest First, Smallest Last in decreasing order of storage size required. That's about as reasonable an approach likely to minimize storage across compilers and hardware, but there is no guarantee any two structs will be allocated the same amount of storage between compilers (other than trivial examples (such as struct foo { int a, b; };)Doviedow
@Someprogrammerdude Why do you need arrays of millions of structures? I believe, in HPC it's quite common. For instance, we work with very large sparse matrices. Our typical workflow is to generate matrix elements and then to convert them into a storage format efficient for further processing. This conversion typically involves sorting. Unfortunately, C++ doesn't support sorting of multiple arrays at once, therefore, we sort them in the form of an array of structs, each having a row/column index and a value. We can work even with billions of matrix elements in a single MPI process.Viehmann
Your description of "not packing the struct" sounds exactly like packing the struct.Priam
"under g++ an int is 4 bytes and a double is 8 bytes". Well, on an Arduino (the underlying compiler is GCC, used as a C++ compiler), a double is the same as a float (4 bytes), which may come as a surprise for some (especially if more than 7-8 significant digits are required, say, for a frequency counter...).Ceporah
Possible duplicate of Structure padding and packingRaisin
@chrylis doesn't "packing the struct" entail unaligned access? There's a middle way where you reorder the elements.Groenendael
@Groenendael Not necessarily. In particular, it's common for alignment to be something along the lines of "larger of word or operand size", meaning that (int, int, double) is naturally aligned without padding.Priam
If you found this question useful then here are some other ways in which you can optimize your code on the low level.Kaitlin
@DanielLangr I asked because I wanted the OP to elaborate about the real problem instead of how to fix a solution to an unknown (for us) problem.Immoral
C
39

(Don't apply these rules without thinking. See ESR's point about cache locality for members you use together. And in multi-threaded programs, beware false sharing of members written by different threads. Generally you don't want per-thread data in a single struct at all for this reason, unless you're doing it to control the separation with a large alignas(128). This applies to atomic and non-atomic vars; what matters is threads writing to cache lines regardless of how they do it.)


Rule of thumb: largest to smallest alignof(). There's nothing you can do that's perfect everywhere, but by far the most common case these days is a sane "normal" C++ implementation for a normal 32 or 64-bit CPU. All primitive types have power-of-2 sizes.

Most types have alignof(T) = sizeof(T), or alignof(T) capped at the register width of the implementation. So larger types are usually more-aligned than smaller types.

Struct-packing rules in most ABIs give struct members their absolute alignof(T) alignment relative to the start of the struct, and the struct itself inherits the largest alignof() of any of its members.

  • Put always-64-bit members first (like double, long long, and int64_t). ISO C++ of course doesn't fix these types at 64 bits / 8 bytes, but in practice on all CPUs you care about they are. People porting your code to exotic CPUs can tweak struct layouts to optimize if necessary.

  • then pointers and pointer-width integers: size_t, intptr_t, and ptrdiff_t (which may be 32 or 64-bit). These are all the same width on normal modern C++ implementations for CPUs with a flat memory model.

    Consider putting linked-list and tree left/right pointers first if you care about x86 and Intel CPUs. Pointer-chasing through nodes in a tree or linked list has penalties when the struct start address is in a different 4k page than the member you're accessing. Putting them first guarantees that can't be the case.

  • then long (which is sometimes 32-bit even when pointers are 64-bit, in LLP64 ABIs like Windows x64). But it's guaranteed at least as wide as int.

  • then 32-bit int32_t, int, float, enum. (Optionally separate int32_t and float ahead of int if you care about possible 8 / 16-bit systems that still pad those types to 32-bit, or do better with them naturally aligned. Most such systems don't have wider loads (FPU or SIMD) so wider types have to be handled as multiple separate chunks all the time anyway).

    ISO C++ allows int to be as narrow as 16 bits, or arbitrarily wide, but in practice it's a 32-bit type even on 64-bit CPUs. ABI designers found that programs designed to work with 32-bit int just waste memory (and cache footprint) if int was wider. Don't make assumptions that would cause correctness problems, but for "portable performance" you just have to be right in the normal case.

    People tuning your code for exotic platforms can tweak if necessary. If a certain struct layout is perf-critical, perhaps comment on your assumptions and reasoning in the header.

  • then short / int16_t

  • then char / int8_t / bool

  • (for multiple bool flags, especially if read-mostly or if they're all modified together, consider packing them with 1-bit bitfields.)

(For unsigned integer types, find the corresponding signed type in my list.)

A multiple-of-8 byte array of narrower types can go earlier if you want it to. But if you don't know the exact sizes of types, you can't guarantee that int i + char buf[4] will fill an 8-byte aligned slot between two doubles. But it's not a bad assumption, so I'd do it anyway if there was some reason (like spatial locality of members accessed together) for putting them together instead of at the end.

Exotic types: x86-64 System V has alignof(long double) = 16, but i386 System V has only alignof(long double) = 4, sizeof(long double) = 12. It's the x87 80-bit type, which is actually 10 bytes but padded to 12 or 16 so it's a multiple of its alignof, making arrays possible without violating the alignment guarantee.

And in general it gets trickier when your struct members themselves are aggregates (struct or union) with a sizeof(x) != alignof(x).

Another twist is that in some ABIs (e.g. 32-bit Windows if I recall correctly) struct members are aligned to their size (up to 8 bytes) relative to the start of the struct, even though alignof(T) is still only 4 for double and int64_t.
This is to optimize for the common case of separate allocation of 8-byte aligned memory for a single struct, without giving an alignment guarantee. i386 System V also has the same alignof(T) = 4 for most primitive types (but malloc still gives you 8-byte aligned memory because alignof(maxalign_t) = 8). But anyway, i386 System V doesn't have that struct-packing rule, so (if you don't arrange your struct from largest to smallest) you can end up with 8-byte members under-aligned relative to the start of the struct.


Most CPUs have addressing modes that, given a pointer in a register, allow access to any byte offset. The max offset is usually very large, but on x86 it saves code size if the byte offset fits in a signed byte ([-128 .. +127]). So if you have a large array of any kind, prefer putting it later in the struct after the frequently used members. Even if this costs a bit of padding.

Your compiler will pretty much always make code that has the struct address in a register, not some address in the middle of the struct to take advantage of short negative displacements.


Eric S. Raymond wrote an article The Lost Art of Structure Packing. Specifically the section on Structure reordering is basically an answer to this question.

He also makes another important point:

9. Readability and cache locality

While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing. There are two more issues: readability and cache locality.

In a large struct that can easily be split across a cache-line boundary, it makes sense to put 2 things nearby if they're always used together. Or even contiguous to allow load/store coalescing, e.g. copying 8 or 16 bytes with one (unaliged) integer or SIMD load/store instead of separately loading smaller members.

Cache lines are typically 32 or 64 bytes on modern CPUs. (On modern x86, always 64 bytes. And Sandybridge-family has an adjacent-line spatial prefetcher in L2 cache that tries to complete 128-byte pairs of lines, separate from the main L2 streamer HW prefetch pattern detector and L1d prefetching).


Fun fact: Rust allows the compiler to reorder structs for better packing, or other reasons. IDK if any compilers actually do that, though. Probably only possible with link-time whole-program optimization if you want the choice to be based on how the struct is actually used. Otherwise separately-compiled parts of the program couldn't agree on a layout.


(@alexis posted a link-only answer linking to ESR's article, so thanks for that starting point.)

Contorted answered 26/6, 2019 at 20:11 Comment(5)
Although this isn't really a fully cross-platform solution and not an automated one, it contains the most actual information about how this problem can be gone about, so I will accept it. Maybe I will later make a community wiki here instead.Kaitlin
@YanB.: I didn't fully read the question before answering; I didn't realize you were mostly looking for automated solutions instead of a rule-of-thumb. But fortunately there's enough similarity between all the modern mainstream 32 and 64-bit CPUs that we actually care about that we can give useful advice despite the fact that ISO C++ guarantees basically nothing. There's a large collection of "assumptions" about what's normal that go with C++ (and modern CPUs), separate from the ISO C++ standard. Much of this is nearly required for a C++ implementation to be useful for anything in practice!Contorted
Smaller to larger sort order is perhaps better overall: it results in more efficient access to most members (eg due to the offset being smaller as you point out, but also because more members of the struct tend to fall within a cache line). The main downsize is that padding holes are more likely to appear in the middle of the structure, rather than the end, so copying may be less efficient in some unusual cases.Kingpin
@BeeOnRope: especially with gcc missed-optimizations. GCC8's store coalescing for struct zeroing refuses to overwrite padding: gcc.gnu.org/bugzilla/show_bug.cgi?id=82142Contorted
It doesn't seem to be a universal problem. See my quick test.Kingpin
G
32

gcc has the -Wpadded warning that warns when padding is added to a structure:

https://godbolt.org/z/iwO5Q3:

<source>:4:12: warning: padding struct to align 'X::b' [-Wpadded]
    4 |     double b;
      |            ^

<source>:1:8: warning: padding struct size to alignment boundary [-Wpadded]
    1 | struct X
      |        ^

And you can manually rearrange members so that there is less / no padding. But this is not a cross platform solution, as different types can have different sizes / alignments on different system (Most notably pointers being 4 or 8 bytes on different architectures). The general rule of thumb is go from largest to smallest alignment when declaring members, and if you're still worried, compile your code with -Wpadded once (But I wouldn't keep it on generally, because padding is necessary sometimes).

As for the reason why the compiler can't do it automatically is because of the standard ([class.mem]/19). It guarantees that, because this is a simple struct with only public members, &x.a < &x.c (for some X x;), so they can't be rearranged.

Guggle answered 25/6, 2019 at 20:48 Comment(4)
I honestly didn't think I'd see something useful come out of this question. Wasn't aware of that gcc option (and now I'm hopign clang has it as well). Thanks for teaching me something. tick.Recapitulate
@Recapitulate Yes, clang also has this option (it even has the same name). It's very helpful (at least for me) when dealing with the "rearrangement problem". It's a bummer that (at least for now) I haven't found an automated solution.Kaitlin
Are there any remotely-modern platforms in which placing types in the order double, [unsigned] long long, [i]int64_t, int64_t, pointers, long, float, int32_t, int, int16_t, short, char, would not yield optimal alignment?Turnbull
Your links to eel.is expire as the draft standard evolves. It would be useful to link to a finalized document. See here. Example: [class.mem]/19.Caroncarotene
D
14

There really isn't a portable solution in the generic case. Baring minimal requirements the standard imposes, types can be any size the implementation wants to make them.

To go along with that, the compiler is not allowed to reorder class member to make it more efficient. The standard mandates that the objects must be laid out in their declared order (by access modifier), so that's out as well.

You can use fixed width types like

struct foo
{
    int64_t a;
    int16_t b;
    int8_t c;
    int8_t d;
};

and this will be the same on all platforms, provided they supply those types, but it only works with integer types. There are no fixed-width floating point types and many standard objects/containers can be different sizes on different platforms.

Disconcert answered 25/6, 2019 at 20:50 Comment(6)
Adding salt to the wound, floating point types are frequently hyper sensitive to bus-alignment positions, thereby enhancing the no-silver-bullet mantra. Regardless, this is very useful when loading up structs with anything other than floating point and potentially pointers. I use it frequently.Recapitulate
Why isn’t member rearrangement allowed? Could you clarify?Kaitlin
If you take the cross-platform portability to the limit, note that these "exact width" types are optional. Every platform must have int_least16_t and int_fast16_t, but (for example if CHAR_BIT != 8), int16_t need not exist on a given platform.Mortal
@Mortal While they are optional, the code will fail to compile if they are not present so at least you wont get binary that blows up on you.Disconcert
You can store a float in a 4 byte int. Just the reading and writing is ugly.Sitka
@YanB. Because the standard says so. Also see #118568. As for the rationale, lots of things would be broken if compilers were free to do that (among other things, imagine a program that writes structs directly to files with fwrite and reads them back with fread; changes to the compiler could suddenly break file format compatibility for compiled programs).Cioffi
L
5

Mate, in case you have 3GB of data, you probably should approach an issue by other way then swapping data members.

Instead of using 'array of struct', 'struct of arrays' could be used. So say

struct X
{
    int a;
    double b;
    int c;
};

constexpr size_t ArraySize = 1'000'000;
X my_data[ArraySize];

is going to became

constexpr size_t ArraySize = 1'000'000;
struct X
{
    int    a[ArraySize];
    double b[ArraySize];
    int    c[ArraySize];
};

X my_data;

Each element is still easily accessible mydata.a[i] = 5; mydata.b[i] = 1.5f;....
There is no paddings (except a few bytes between arrays). Memory layout is cache friendly. Prefetcher handles reading sequential memory blocks from a few separate memory regions.

That's not as unorthodox as it might looks at first glance. That approach is widely used for SIMD and GPU programming.


Array of Structures (AoS), Structure of Arrays

Linares answered 28/6, 2019 at 2:6 Comment(1)
This is much better when SIMD is possible. But when you need scattered / random access to structs (and need multiple members of the same struct, but not anything from nearby structs) SoA costs you 3x the cache misses. It also costs you more pointers/registers, especially for non-CISC and/or non-static allocation. But if SIMD is an option for any of your loops then yes it's usually much better to have SoA.Contorted
A
4

This is a textbook memory-vs-speed problem. The padding is to trade memory for speed. You can't say:

I don't want to "pack" my struct.

because pragma pack is the tool invented exactly to make this trade the other way: speed for memory.

Is there a reliable cross-platform way

No, there can't be any. Alignment is strictly platform-dependent issue. Sizeof different types is a platform-dependent issue. Avoiding padding by reorganizing is platform-dependent squared.

Speed, memory, and cross-platform - you can have only two.

Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?

Because the C++ specifications specifically guarantee that the compiler won't mess up your meticulously organized structs. Imagine you have four floats in a row. Sometimes you use them by name, and sometimes you pass them to a method that takes a float[3] parameter.

You're proposing that compiler should shuffle them around, potentially breaking all the code since the 1970s. And for what reason? Can you guarantee that every programmer ever will actually want to save your 8 bytes per struct? I'm, for one, sure that if I have 3 GB array, I'm having bigger problems than a GB more or less.

Aftonag answered 26/6, 2019 at 9:49 Comment(15)
I'd argue that the only issue here is the “sometimes you pass them to a method that takes float[3] parameter”. Well, that's a pretty special use-case. In fact I'd say it's the main problem here that C++ supports this kind of pointer juggling; if it didn't do that and instead allowed the compiler to always reorder for optimisation then it lots of code would run faster, whilst the programs that would need to be rewritten to wrap that float[3] explicitly in an array would have neglectable performance penalty.Nantucket
I'm pretty sure that type-punning four individual floating point member-variables to pass them as a float[3] invokes undefined behavior.Haemachrome
@JeremyFriesner: Note that "Undefined Behavior" was intended to allow implementations that could offer more useful semantics to do so when practical, before language vandals took over and started using it as an excuse not to offer useful semantics even in cases where they would have cost nothing.Turnbull
@Turnbull regardless of historical intent, invoking undefined behavior isn't something one wants to do (unless you enjoy discovering and diagnosing obscure runtime misbehaviors)Haemachrome
@JeremyFriesner: The Standard has never sought to require that implementations support all of the semantics necessary for any particular purpose. On many target platforms, I/O would be impossible without using pointers to represent addresses that do not identify "objects" as the Standard defines the term. If one weren't allowed to perform actions upon which the Standard imposes no requirements, one wouldn't be able to do anything on such platforms.Turnbull
@JeremyFriesner: To be sure, one would be asking for trouble if one tried to use such low-level programming techniques on an implementation that was not designed or configured to be suitable for such purposes, but using an implementation that's unsuitable for any particular job one is trying to do would be asking for trouble.Turnbull
@Turnbull actually it was not "language vandals taking over", it was compiler writers that were able to squeeze out some more optimization opportunities by taking "undefined behaviour" literally. Essentially, you wish the compiler would do something sane, while the compiler writer prefers to do something fast (because that improves the benchmarks, which in turn improve sales/mindshare, and actually improve runtime speed even for pretty normal programs).Bandeau
@toolforger: Have you read the published Rationale? According to the Committee, the most fundamental aspects of the Spirit of C are "trust the programmer" and "Don't prevent the programmer from doing what needs to be done". They also explicitly recognize that one of C's strengths is the ability to use non-portable programs to do things that portable programs cannot (because the Standard does not provide for them). If some task cannot be done without performing some action, all implementations suitable for the task will support that action whether or not the Standard requires it.Turnbull
@toolforger: Compiler writers introduce a false dichotomy between speed and semantics. For a compiler to sometimes treat signed integer math as though performed on a wider type would enable 90%+ of the useful optimizations associated with jumping the rails on overflow. If such a compiler was given source code that exploits the fact that that's all it will do, it could achieve optimizations that would not be possible with source code written for an "overflow must be avoided at all costs" model.Turnbull
@toolforger: More generally, an optimization that assumes a programmer won't need to do X may be useful for programs that need to do X, but will be counter-productive in cases where the required behavior is precisely what would have been achieved by simply doing X. If action X is needed for some tasks but not others, and if the costs of supporting X on different implementations would vary, X should be supported on implementations or configurations used for tasks requiring it, but not on those where it would impose needless expense. That should be self-evident, but apparently isn't.Turnbull
@Turnbull the points you raise are about the language standard, not about compiler writers. Besides, the dichotomy isn't false - the ability to ignore undefined cases (instead of doing what you'd like them to do) can give speedups of up to 50%. It's really speed issues that have turned the C standard into something riddled with undefined behaviour, not "language vandals".Bandeau
BTW this is turning into the extended discussion of background details, which is not what comments are for.Bandeau
@toolforger: One brief parting question, then: do you believe the authors of the Standard intended to preclude the use of the language as a form of "high-level assembler"?Turnbull
@Turnbull "high-level assembler" is certainly high on the priority list, but there are certainly others. Since every decision in language design is a trade-off, there isn't even going to be a clear-cut "language X is geared towards feature A", ever; it's always a gradual thing.Bandeau
Let us continue this discussion in chat.Turnbull
T
2

Although the Standard grants implementations broad discretion to insert arbitrary amounts of space between structure members, that's because the authors didn't want to try to guess all the situations where padding might be useful, and the principle "don't waste space for no reason" was considered self-evident.

In practice, almost every commonplace implementation for commonplace hardware will use primitive objects whose size is a power of two, and whose required alignment is a power of two that is no larger than the size. Further, almost every such implementation will place each member of a struct at the first available multiple of its alignment that completely follows the previous member.

Some pedants will squawk that code which exploits that behavior is "non-portable". To them I would reply

C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine specific code is one of the strengths of C.

As a slight extension to that principle, the ability of code which need only run on 90% of machines to exploit features common to that 90% of machines--even though such code wouldn't exactly be "machine-specific"--is one of the strengths of C. The notion that C programmers shouldn't be expected to bend over backward to accommodate limitations of architectures which for decades have only been used in museums should be self-evident, but apparently isn't.

Turnbull answered 26/6, 2019 at 19:4 Comment(0)
J
1

You can use #pragma pack(1), but the very reason of this is that the compiler optimizes. Accessing a variable through the full register is faster than accessing it to the least bit.

Specific packing is only useful for serialization and intercompiler compatibility, etc.

As NathanOliver correctly added, this might even fail on some platforms.

Jenness answered 25/6, 2019 at 20:33 Comment(2)
May want to note this carries potential performance issues or may cause the code to not work on some platforms: #7794011Disconcert
To my knowledge, using #pragma pack causes potential performance issues and as such is not the desired solution.Kaitlin

© 2022 - 2024 — McMap. All rights reserved.