Can anyone point to benchmark results comparing performance of C11/C++11 code using relaxed atomic operations (particularly memory_order_release
and memory_order_acquire
, but also memory_order_consume
and memory_order_relaxed
) versus the default memory_order_seq_cst
? All architectures are of interest. Thanks in advance.
I did a bit of benchmarking on ARMv7, see https://github.com/reinhrst/ARMBarriers for the report, the slides for my talk at EuroLLVM, and the seqlock code I used.
Short story: in the seqlock code, the Acquire/Release function was about 40% faster than the sequentially consistent version.
This might not be the best solution but so far I have been using CDSChecker for some bench-marking in one of my projects. I have not yet used it on complete programs but more just on independent units.
For a particular chunk of code (a work-stealing dequeue), I found a very nice paper that benchmarks a C11 version with weak atomics, with sc-atomics only, hand-optimized assembly, and an incorrect version using fully relaxed atomics. (By coincidence, a bug was later found in the C11 version by the aforementioned CDSChecker.) Similar examples are welcome.
The question as such doesn't make sense, and it's important to understand why.
An atomic operation is just a simple operation on a scalar object except that it can be used for inter thread communication; the ordering only affects what is guaranteed for other memory locations.
[Note: the standard text doesn't formally guarantees that, but it's meant to guarantee that, and a consistent approach to C/C++ thread semantics must be based on that.]
You can compare the speed of a multiplication and a cosine, but not compare the cost of outputting "hello world" and flushing cout
. The flush on a stream or file doesn't have an intrinsic price tag: it relates to other operations.
You can't compare the speed of an operation that blocks until some previous operation is complete with one that doesn't.
Also, you can't benchmark in a vacuum. You need some workload, a pattern of operations.
You would need to learn a lot on modern CPU design, and by modern I mean anything that was invented in the last two decades. You have to have at least some idea of the complexity of a real CPU and the way it runs code, how multi core operates, and the general principles of memory caches, to even dream of designing a useful benchmark in the abstract.
Or, you could just write your program, and profile it to see if there is really an issue with atomic operations.
© 2022 - 2024 — McMap. All rights reserved.
seq_cst
load in find with anacquire
load and got a 65% performance increase. This perf presumably came not at the instruction level (since any reasonable compiler for x86 will compile a load in the same way), but because the compiler was able to reorder things around it. This is a great example of why you need real benchmarks and not just cycle counts. – Minority