TL;DR: start with no boxing, then profile.
Stack Allocation vs Boxed Allocation
This is perhaps more clear cut:
- Stick to the stack,
- Unless the value is big enough that it would blow it up.
While semantically writing fn foo() -> Bar
implies moving Bar
from the callee frame to the caller frame, in practice you are more likely to end up with the equivalent of a fn foo(__result: mut * Bar)
signature where the caller allocates space on its stack and passes a pointer to the callee.
This may not always be sufficient to avoid copying, as some patterns may prevent writing directly in the return slot:
fn defeat_copy_elision() -> WithDrop {
let one = side_effectful();
if side_effectful_too() {
one
} else {
side_effects_hurt()
}
}
Here, there is no magic:
- if the compiler uses the return slot for
one
, then in case the branch evaluates to false
it has to move one
out then instantiate the new WithDrop
into it, and finally destroy one
,
- if the compiler instantiates
one
on the current stack, and it has to return it, then it has to perform a copy.
If the type didn't need Drop
, there would be no issue.
Despite these oddball cases, I advise sticking to the stack if possible unless profiling reveals a place where it'd be beneficial to box.
Inline Member or Boxed Member
This case is much more complicated:
the size of the struct
/enum
is affected, thus CPU cache behavior is affected:
- less frequently used big variants are a good candidate for boxing (or boxing parts of them),
- less frequently accessed big members are a good candidate for boxing.
at the same time, there are costs for boxing:
- it's incompatible with
Copy
types, and implicitly implements Drop
(which, as seen above, disables some optimizations),
- allocating/freeing memory has unbounded latency1,
- accessing boxed memory introduces data-dependency: you cannot know which cache line to request before knowing the address.
As a result, this is a very fine balancing act. Boxing or unboxing a member may improve the performance of some parts of the codebase while decreasing the performance of others.
There is definitely no one-size fits all.
Thus, once again, I advise avoiding boxing until profiling reveals a place where it'd be beneficial to box.
1 Consider that on Linux, any memory allocation for which there is no spare memory in the process may require a system call, which if there is no spare memory in the OS may trigger the OOM killer to kill a process, at which point its memory is salvaged and made available. A simple malloc(1)
may easily require milliseconds.