C11/C++11 weak memory benchmarks

Asked 3/11, 2013 at 11:36 Answered 15/12, 2019 at 8:29

Can anyone point to benchmark results comparing performance of C11/C++11 code using relaxed atomic operations (particularly memory_order_release and memory_order_acquire, but also memory_order_consume and memory_order_relaxed) versus the default memory_order_seq_cst? All architectures are of interest. Thanks in advance.

Minority answered 3/11, 2013 at 11:36 Comment(11)

I daresay that performance does not matter so much as does correctness. You should use memory ordering to the semantics of your loads and stores correct, not use something different because it is maybe 3 cycles faster. OTOH, if you have no clue, stay with the default, which may be slightly slower (often isn't even), but is guaranteed to be correct. There's nothing more of a nightmare than "optimized" multithreaded code that's incorrect. – Ovi 3/11, 2013 at 11:47

The default sequentially consistent memory ordering is stricter than any of the others, so performance is the only reason to use weaker orderings. I know how to reason about correctness in either case, but the reasoning is more difficult for weaker orderings. In any event, it's not a question of 3 cycles; even on an i7, flushing the store buffer is typically tens of cycles (perhaps much more on older processors), and the difference between seq_cst and consume on ARM is much greater still. – Minority 3/11, 2013 at 12:8

It is hardly possible to get meaningful results from a benchmark. A single load/store is hardly measurable in a reliable way. Loading and storing the same value (or a set of values) many times without synchronization will only benchmark cache thrashing, but not the actual operation. Synchronizing will measure the synchronization primitive, which is ~1000X more expensive. That, and you really have no choice but to write correct code. If you need happens-before guarantee on some data, you cannot leave it out just because it would be faster. – Ovi 3/11, 2013 at 12:10

@Ovi there's nothing wrong with wanting to understand the performance of different approaches. That does not imply that you plan to sacrifice correctness. – Thurnau 3/11, 2013 at 12:10

@jalf: That is of course correct. Which still leaves the problem that it's pretty much impossible to measure this. How do you e.g. measure the impact of instruction reordering? Benchmarking such is simply not meaningful for a real program (only way I could imagine would be to use an existing real program, such as a database, and recompile this with stricter semantics, and stress-test / benchmark the application, and even then it's more than questionable whether this translates to a different application, since it may have entirely different access patterns). – Ovi 3/11, 2013 at 12:16

@Ovi is right that just banging on an atomic with multiple threads would just thrash the cache. But there are examples that don't thrash the cache. For example, I seem to recall that rewriting spinlock release to use a simple write on x86 (which is a release) instead of an interlocked store resulted in a measurable performance increase for Linux. Another example that I would expect to be measurable would traversing lockfree linked lists using consume on ARM. – Minority 3/11, 2013 at 12:20

In this talk of facebook engineering about the implementation of a atomic hashmap in their open-source library folly, they claim that they gained a performance boost of ~50% by using relaxed semantics. Nevertheless, they are also doing some other hacking, still it's an interesting talk and worth watching. facebook.com/… – Outspan 3/11, 2013 at 12:23

They're however using compare_exchange_strong as well as release in the critical part, not relaxed. Which is why it works. They're not just using relaxed. – Ovi 3/11, 2013 at 12:37

Checking the code, they are using both. github.com/facebook/folly/blob/master/folly/AtomicHashMap-inl.h – Outspan 3/11, 2013 at 12:44

The relevant slide is at the end of the talk, where the replaced a seq_cst load in find with an acquire load and got a 65% performance increase. This perf presumably came not at the instruction level (since any reasonable compiler for x86 will compile a load in the same way), but because the compiler was able to reorder things around it. This is a great example of why you need real benchmarks and not just cycle counts. – Minority 3/11, 2013 at 12:58

@Ovi sure, meaningfully benchmarking something like this is very very tricky. I'm just pointing out that asking about, and being interested in understanding, the performance aspects as well is certainly not irrelevant. – Thurnau 3/11, 2013 at 13:8

I did a bit of benchmarking on ARMv7, see https://github.com/reinhrst/ARMBarriers for the report, the slides for my talk at EuroLLVM, and the seqlock code I used.

Short story: in the seqlock code, the Acquire/Release function was about 40% faster than the sequentially consistent version. enter image description here

Caper answered 13/4, 2014 at 7:27 Comment(2)

The article is very nice. On the other hand, it seems to argue that the gap between sc and acq-rel is (at least in this example) purely due to lazy compilation, since with your proposed compiler passes in place, the gap above is essentially eliminated, right? – Minority 28/6, 2015 at 14:33

It's been a while since I dove into this, but as far as I recall the gap between sc and acq-rel is (in this example) due to missing compiler optimisations (before the patch mentioned in the GitHub repo was added to llvm), because in this example, in the ARM memory model, acq-rel gives you the same guarantees as sc (again: in the provided example, in the ARM memory model, pre-llvm-patch). In no way I'm trying to argue that sc and acq-rel should be equally fast in the general case. – Caper 29/6, 2015 at 21:43

This might not be the best solution but so far I have been using CDSChecker for some bench-marking in one of my projects. I have not yet used it on complete programs but more just on independent units.

Rescissory answered 28/3, 2014 at 11:0 Comment(1)

CDSChecker isn't designed for benchmarking, just for exhaustive testing for correctness. CDSChecker provides no information about what the performance of your code will be. – Dyne 29/1, 2015 at 10:21

For a particular chunk of code (a work-stealing dequeue), I found a very nice paper that benchmarks a C11 version with weak atomics, with sc-atomics only, hand-optimized assembly, and an incorrect version using fully relaxed atomics. (By coincidence, a bug was later found in the C11 version by the aforementioned CDSChecker.) Similar examples are welcome.

Minority answered 28/6, 2015 at 14:24 Comment(0)

-1

The question as such doesn't make sense, and it's important to understand why.

An atomic operation is just a simple operation on a scalar object except that it can be used for inter thread communication; the ordering only affects what is guaranteed for other memory locations.

[Note: the standard text doesn't formally guarantees that, but it's meant to guarantee that, and a consistent approach to C/C++ thread semantics must be based on that.]

You can compare the speed of a multiplication and a cosine, but not compare the cost of outputting "hello world" and flushing cout. The flush on a stream or file doesn't have an intrinsic price tag: it relates to other operations.

You can't compare the speed of an operation that blocks until some previous operation is complete with one that doesn't.

Also, you can't benchmark in a vacuum. You need some workload, a pattern of operations.

You would need to learn a lot on modern CPU design, and by modern I mean anything that was invented in the last two decades. You have to have at least some idea of the complexity of a real CPU and the way it runs code, how multi core operates, and the general principles of memory caches, to even dream of designing a useful benchmark in the abstract.

Or, you could just write your program, and profile it to see if there is really an issue with atomic operations.

Bray answered 15/12, 2019 at 8:29 Comment(0)

Recommended topics

Hot tags