Why is sizeof std::mutex == 40 when cache line size is often 64 bytes

Asked 2/10, 2020 at 11:7 Answered 2/10, 2020 at 13:44

Solved c++optimization x86 false-sharing stdmutex

Following static_assert passes in both gcc and clang trunk.

#include<mutex>
int main(){
    static_assert(sizeof(std::mutex)==40);
}

Since x86 CPUs have 64 byte cache line I was expecting mutex sizeof to be 64, so false-sharing can be avoided. Is there a reason why size is "only" 40 bytes?

note: I know size also costs performance but rarely there is a huge number of mutexes in a program so size overhead seems to be negligible compared to cost of false sharing.

note:there is a similar question asking why std::mutex is so large, I am asking why is it so small :)

edit: MSVC 16.7 has sizeof 80.

Secern answered 2/10, 2020 at 11:7 Comment(4)

If we assume std:mutex is little more than a glorified structure and that the default minimum alignment of a structure depends on its largest field and not the size of the structure as a whole; then it's reasonable to assume that sizeof(std:mutex) has almost nothing to do with minimal alignment at all, and is even less indicative of optimal alignment. Instead, if you want 64 byte alignment you want 64 byte alignment regardless of structure size (e.g. using something like alignas(64)); and sizeof() is mostly irrelevant, and std::alignment_of() should be used instead. – Malacology 2/10, 2020 at 11:25

What is "false shared" here? How does false-sharing concept apply to mutexes which ARE shared? – Conferral 2/10, 2020 at 11:33

@DanM.: Other data in the same cache line, including possibly the shared data protected by the mutex. If other threads are hammering on the mutex trying to take the lock, the cache line containing it will tend to flip to shared state, or even be invalidated from the owner's L1d cache. The question is proposing that alignof(mutex) = sizeof(mutex) = std::hardware_destructive_interference_size or something to make sure a mutex has a cache line to itself. (note that hw_destructive_... should be 128 on some modern x86-64, if you're going to fix a size for it, because of adjacent-line HW prefetch) – Universe 2/10, 2020 at 11:36

@DanM. what Peter Cordes said is what I am asking(although he explained it much better than I could) – Secern 2/10, 2020 at 11:37

Forcing padding where it's not needed would be bad design. Users can always pad if they have nothing useful to put in the rest of the cache line.

You probably want it in the same cache line as the data it's protecting if it's usually lightly contended; only one cache line to bounce around, instead of a 2nd cache miss when accessing the shared data after acquiring the lock. This is probably common with fine-grained locking where many objects have their own std::mutex, and makes it more beneficial to keep it small.

(Heavily contended could create false sharing between readers trying to acquire the lock vs. the lock owner writing to the shared data after gaining ownership of the lock. Flipping the cache line to "shared", or invalidating, before the lock owner has a chance to write, would indeed slow things down).

Or the space in the rest of the line could be some very-rarely-used thing that needs to exist somewhere in the program, but maybe only used for error handling so its performance doesn't matter. If it couldn't share a line with a mutex, it would have to be taking up space somewhere else. (Maybe in some page of "cold" data, so this isn't a great example).

It's probably unlikely that you'd want to malloc or new a mutex itself, although one could be part of a class you dynamically allocate. Allocator overhead is a real thing, e.g. using 16 bytes of memory before the allocation for bookkeeping space. (Large allocations with glibc's malloc/new are often page-aligned + 16 bytes, making them misaligned wrt. all wider boundaries). Dynamic-allocator bookkeeping is a very good thing for a mutex to be sharing space with: it's probably not read or written by anything while the mutex is in use.

Non-lock-free std::atomic objects typically use an array of locks (maybe just simple spinlocks, but could be std::mutex). If the latter, you don't expect two adjacent mutexes to be used simultaneously so it's good to pack them all together.

Also, increasing its size would be a very clunky way to try to ensure no false sharing. An implementation that wanted to make sure a std::mutex had a cache line to itself this would want to declare it with alignas(64) to make sure its alignof() was that. That would force padding to make sizeof(mutex) a multiple of alignof (in this case equal).

But note that std::hardware_destructive_interference_size should be 128 on some modern x86-64, if you're going to fix a size for it, because of adjacent-line hardware prefetch in Intel's L2 caches. That's a weaker destructive effect than same cache-line, and that's too much space to waste.

Universe answered 2/10, 2020 at 11:33 Comment(9)

Accepted the answer since it provides many good points, but I am obviously not certain that for "average" program this is the correct choice by STL implementers, but then again it is impossible for me or anybody to profile every C++ program on the planet. :) – Secern 2/10, 2020 at 15:43

@NoSenseEtAl: There's no good way to make it hand-hold beginners without making it almost literally impossible to optimize it if you do know what you're doing. I think you're still ignoring the fine-grained locking case where you have a std::mutex inside each instance of an object, so could have quite a few allocated. More importantly, any given std::mutex is usually not heavily contended in most programs, so it's not worth wasting valuable space in a hot cache line on padding. – Universe 2/10, 2020 at 22:30

As my edit to the answer says mutex on msvc is 80 bytes... did they intentionally pad or not? I could not find out, they just use some magic number _Mtx_internal_imp_size that is 80... – Secern 3/10, 2020 at 11:3

@NoSenseEtAl: No idea, I don't have an MSVC install or a Windows dev environment. I'd assume there isn't intentional padding, but IDK what they'd be using all that space for. Presumably including some handles for kernel resources, but IDK what else. – Universe 3/10, 2020 at 11:20

STL is on github but without VS you can not click through stuff so it is hard to navigate... I am not saying you should do it, or 80 proves I am right but you know msvc obviously picked size over 64, it is possible that they did it intentionally due to false sharing... but no way to know, could be just some large stuff they need to keep the mutex state. – Secern 3/10, 2020 at 11:26

@NoSenseEtAl: I'm somewhat curious whether MSVC included intentional padding or not, but as you say it doesn't change my answer to the question. (Or my opinion that glibc's 40 bytes is not "too small"). Besides, 80 bytes with only an 8-byte alignment requirement (I assume) is not particularly good for avoiding false sharing; depending on where the part is that other threads will access, it can still be in the same cache line as up to 56 bytes of stuff to create false sharing. (If users don't put the mutex as the first object in their class). – Universe 3/10, 2020 at 11:40

I was mostly concerned with scenarios like this mutex m1; mutex m2; (where they are adjacent in memory) or mutex arrays. – Secern 3/10, 2020 at 12:41

@NoSenseEtAl: I guess you now realize that there are many other use-cases for mutexes, so wasting space for those other use-cases would be a tradeoff against improving performance for cases like you describe (and then only when such code doesn't take precautions to avoid false sharing of the mutexes). I didn't think it was all that common to have a bunch of mutexes declared together, and I think arrays of mutexes are probably even rarer. – Universe 3/10, 2020 at 12:48

I just do not know, like I said in my first comment: "but then again it is impossible for me or anybody to profile every C++ program on the planet". But again I do not think that unless somebody from Dinkumware/MSFT comes here and tells us why they did what they did we will ever get a better answer than your. :) – Secern 3/10, 2020 at 12:52

Maybe your solution is to use alignas? something like

alignas(std::hardware_destructive_interference_size) std::mutex mut;

Now your mutex is on a hardware boundary.

Inhume answered 2/10, 2020 at 13:44 Comment(1)

1) I know this, my question was why stl implementations do not guard against this since I assume most people do not know it 2) hardware_destructive_interference_size remains unimplemented on gcc/clang and it may be like that forever #62026086 – Secern 2/10, 2020 at 15:35

Recommended topics

Hot tags