The object-representation of a null pointer is not guaranteed to be the bit-pattern 0
, so even in C it's not fully portable to use memset
instead of assignment of ptr = 0;
. But all modern mainstream systems use 0
as the bit-pattern for null pointers. In the rest of this answer, I'll assume that you only care about optimizing for such platforms.
volatile int
is also not guaranteed by standards to be thread-safe in C or C++ (although it does work on most compilers as a legacy way to do something like atomic with memory_order_relaxed
). memset
on a struct containing volatile
members doesn't respect their volatility so your proposed C equivalent was potentially not even safe in practice, let alone guaranteed by anything. The C equivalent of std::atomic<int>
/ std::atomic_int
is _Atomic int
/ atomic_int
. C11 doesn't have a portable atomic_ref
, only compiler-specific stuff like GNU C __atomic_load_n
etc.
To allow efficient zeroing during single-threaded phases of your program, consider using C++20 std::atomic_ref
on a plain int
member of your struct instead of a std::atomic<int>
member. This makes your struct only have primitive types as members, making it safe to use memset
as long as no other threads are reading or writing simultaneously. (Or std::fill
.)
struct Foo {
alignas (std::atomic_ref<int>::required_alignment) int value;
Foo *other;
};
It would still be data-race UB to memset
a struct while any other threads could be reading or writing it, so this only helps if you have phases of your program where synchronization guarantees that no other threads will be accessing your array. Unless you want to rely on non-portable behaviour
One thing that would make it faster is not padding your structs to 4x the size. Your struct would be 16 bytes (on a typical 64-bit architecture) without alignas(64)
, or 8 bytes total with 32-bit pointers. It can make sense to align the whole array of structs by 64, but aligning every individual struct puts each one in a separate cache line. Perhaps you're doing that to avoid false-sharing? That does make it slower to zero them since it's more cache-lines to write, so test to see if (and how much) it speeds up your program on various hardware to have each pointer+counter pair in its own cache line.
With 3/4 of the space being padding (assuming 16-byte Foo
objects), memset
on the whole array would typically be doing at least 2 stores per struct (x86-64 with AVX2 for 32-byte stores). Worse on AArch64 without SVE (16-byte vectors), much worse on RISC-V32 without vector extensions (just 4-byte scalar stores, AFAIK).
So if you are going to use this much padding, it's not bad to just loop manually and do normal assignments to the two members of each struct. If this has to be thread-safe (so you can't just access a plain int
), use memory_order_relaxed
for the atomic member, unless you need something stronger. You certainly don't want to be draining the store buffer for every struct, which is what would happen on x86 with arr[i].value = 0;
(which defaults to seq_cst
), so use arr[i].value.store(0, std::memory_order_relaxed)
.
Looping manually would result in two stores per struct when you really only need one 16-byte store. Compilers don't optimize atomics, and they won't turn .value.store(0, relaxed); .other = nullptr;
into a single 16-byte store, unfortunately. Even without an atomic
member, GCC has a missed-optimization (bug 82142) where it avoids writing padding, stopping it from coalescing stores when doing struct assignment. Using size_t value
or ptrdiff_t value
could avoid that.
With no atomic members, store coalescing should result in a loop that does one 16-byte store (or 8-byte in 32-bit code) per 64-byte struct. Unfortunately GCC fails at that for x86-64, but clang succeeds. GCC gets it right for AArch64.
#include <atomic>
#include <stddef.h>
struct alignas(64) Foo_atomic_ref {
alignas (std::atomic_ref<ptrdiff_t>::required_alignment) ptrdiff_t value; // pointer sized to avoid padding between members which trips up GCC's store coalescing.
Foo_atomic_ref *other;
};
Foo_atomic_ref shared_array_aref[1024];
// if no other threads could be accessing the array
void zero_array_nonatomic(){
for (auto &s : shared_array_aref){
s.value = 0; // GCC for x86-64 fails to use movdqa; AArch64 uses stp though.
s.other = nullptr;
}
}
// if some other threads could be accessing the array
void zero_array_atomic_ref(){
for (auto &s : shared_array_aref){
std::atomic_ref ref(s.value); // free to construct
ref.store(0, std::memory_order_relaxed);
s.other = nullptr; // separate stores from both compilers
}
}
Godbolt - GCC for x86-64 fails to coalesce the two 8-byte stores even in the non-atomic version where that would be possible. But GCC for AArch64 does use scalar stp xzr, xzr, [x0]
for it, which is also a 16-byte store of the zero-register twice. Most microarchitectures run it as a single 16-byte store, at least on ARMv8.4 and later where it's guaranteed atomic. So it's efficient
Clang compiles these to asm at least as good as memset, with unrolled loops without any branching to handle tiny sizes, just an unrolled loop. The nonatomic version only does one movaps xmmword ptr [rax], xmm0
per struct; the atomic_ref
loop does separate stores for each member. Neither spend instructions storing the padding like memset
on the whole array would do.
# clang -O3
zero_array_atomic_ref():
lea rax, [rip + shared_array_aref]
lea rcx, [rip + shared_array_aref+65536]
.LBB1_1:
mov qword ptr [rax], 0
mov qword ptr [rax + 8], 0 # first struct
mov qword ptr [rax + 64], 0
mov qword ptr [rax + 72], 0 # second struct
....
mov qword ptr [rax + 448], 0
mov qword ptr [rax + 456], 0
add rax, 512
cmp rax, rcx
jne .LBB1_1
ret
On real hardware, the atomic_ref
version would also be safe with 16-byte movaps
stores, but hardware vendors don't guarantee it. See Per-element atomicity of vector load/store and gather/scatter? - it's not plausible that a 16-byte aligned store could have tearing at 8-byte boundaries on x86, especially since 8-byte atomicity is guaranteed.
On x86 with AVX, 16-byte stores are guaranteed atomic, so it would actually be fully safe for GCC and clang to coalesce the atomic 8-byte store to .value
with the non-atomic 8-byte store to .other
. But compilers aren't that smart, treating atomic
a lot like volatile
.
It's frustrating to know that most (all?) hardware can do stuff C++ won't let you do portably, but such is life in writing portable programs.
You could manually vectorize with SIMD intrinsics like _mm_store_si128
. Or modern compilers will inline a 16-byte memset
as a single instruction, but there's no guarantee of that. Or use offsetof(Foo, other) + sizeof(Foo::other)
as the size for each memset
, to only write the parts of struct that contain the non-padding data. offsetof
isn't guaranteed on structs that aren't "standard layout", but C++17 makes that "conditionally supported" instead of UB. But of course memset
on objects being read+written by other threads is data-race UB so I don't recommend that unless you want to carefully check your code for every compiler version, and any change to where this zeroing gets inlined into, to make sure it always compiles to safe asm.
(std::fill
wouldn't be usable this way since it takes an iterator range of the same type. You can't pass it a ptrdiff_t *
as the start and a Foo **
as the end iterators. You could pass it a whole Foo *
, but then might do a 64-byte struct assignment, unless the compiler decided to skip storing the padding. If you care a lot about what asm you're getting, it's probably not a good choice.)
With std::atomic<int>
, not plain int
+ atomic_ref
In this case, std::fill
won't compile because std::atomic
is not trivially-copyable (deleted copy-constructor). In C++, struct assignment goes per-member. memset
will bypass that and let you do things you're not "supposed" to do.
If no other threads are running, memset
on a std::atomic<int>
(or a struct containing one) works in practice on mainstream C++ implementations because std::atomic<int>
is lock-free and its object representation is the same as an int
. But I wouldn't recommend writing code this way since ISO C++ doesn't guarantee it.
On GCC/Clang and libstdc++ or libc++, std::atomic<int>
will have a plain int
member, and the member functions are wrappers for compiler built-ins like GNU C __atomic_load_n
and __atomic_fetch_add
. So there'd be no UB in using memset
to change the bytes of the object representation. But again, ISO C++ doesn't guarantee anything about the internals of std::atomic
so this would be relying on implementation details.
For larger non-lock-free atomic objects like atomic<big_struct>
, some compilers (like MSVC: Godbolt) include a spinlock inside the std::atomic
object. IDK if any compilers include a full std::mutex
which shouldn't be zeroed even if you know it's unlocked (no concurrent readers or writers). (Most other compilers use a separate hash table of spinlocks, and people have said MSVC is planning that change for their next C++ ABI break.)
MSVC uses zero as the unlocked state so static zero-initialized storage is usable directly, but in theory an implementation could have 0 meaning locked, so memset would create a deadlock.
Zeroing a spinlock while there are concurrent readers or writers could let some threads see torn values and maybe cause some memory_order weirdness.
It's still data-race UB to memset
while other threads might be reading or writing the object. In practice it will probably just work since any decent memset
will work internally as if storing in chunks at least as wide as int
, especially on a large aligned array, and in asm a relaxed
atomic store doesn't need any special instructions on mainstream platforms.
But without actual guarantees, there's a lot of "this probably won't break" being relied on, so I wouldn't recommend it for portable code. Only consider memset
while other threads are running if you need to squeeze every drop of performance out of something for a specific compiler / standard library and architecture, and are prepared to verify your assumptions by checking the asm of the standard library implementation, or checking what gets inlined into your code.
And as I said earlier, padding your structs to one per cache line means only 1/4 or 1/8th of the bytes actually need to get stored, so an optimized libc memset
isn't such an advantage. If you only aligned your structs by their size, memset
could be a lot better since it's able to use 32-byte stores even if the rest of your program isn't compiled to depend on x86 AVX or ARM SVE or whatever.
On some microarchitectures, storing every byte of a cache line could have advantages in allowing the CPU to just invalidate other copies of the cache line without doing a Read For Ownership. Normally a store needs to merge some bytes into the old value of a cache line, but storing every byte of it can avoid that. But a CPU would have to detect that across multiple stores, not just looking at the first store to a line, unless you're using AVX-512 to store a whole cache line at once. (And even then IDK if x86 CPUs do anything special for aligned vmovdqa64 [mem], zmm
, but they might.) See also Enhanced REP MOVSB for memcpy for more about no-RFO store protocols on x86 CPUs. It's normally most relevant for larger arrays that don't fit in cache, but could be relevant here. NT stores like movntps
/ _mm_stream_ps
also avoid RFOs, but they fully evict from cache, including from L3, so that makes the next read slower. (Shared L3 cache is normally a backstop for coherency traffic.)
Foo array[2048] = {};
? – Clipfedatomic
thing honestly :( – Unalterablestd::array<Foo, 2048> array;
and then assignment likearray = {};
? Doesn't work with atomics though. For the atomic value, there's no way other than explicitly assigning the value0
to it. – Clipfedfor(size_t i = 0; i < 2048; ++i) { array[i].value = 0; array[i].other = nullptr; }
? – Callidastd::ranges::fill(arr, Foo{});
. But if you usestd::array<Foo, 2024> arr;
instead of raw array, then justarr={};
. You don't need any mem function in modern C++. They're buried down the implementation of std features. – Ichthyosisstruct Foo*
->Foo*
. – Sorildaalignas(64)
. It can make sense to align the whole array of structs by 64, but aligning every individual struct puts each one in a separate cache line. Perhaps you're doing that to avoid false-sharing? That does make it slower to zero them. – Interfluentmemset
works in practice on mainstream C++ implementations becausestd::atomic<int>
is lock-free and its object representation is the same as anint
. (For larger non-lock-free atomic objects, some compilers (I think MSVC) include a mutex inside thestd::atomic
object, and you shouldn't zero it.) Onatomic<int>
, It's not UB to change the bytes of astd::atomic<int>
in GNU C++ because internally it's just anint
with member functions using compiler built-ins like GNU C__atomic_load
. – Interfluentstd::atomic_ref
to do atomic ops on plainint
variables, so your struct could bestruct Foo { alignas (std::atomic_ref<int>::required_alignment) int value; Foo *other; }
, making it fully 100% safe tomemset
when no other thread can be accessing it. – Interfluent