Is lockless hashing without std::atomics guaranteed to be thread-safe in C++11?
Asked Answered
F

3

4

Consider the following attempt at a lockless hashtable for multithreaded search algorithms (inspired by this paper)

struct Data
{
    uint64_t key;
    uint64_t value;
};

struct HashEntry
{
    uint64_t key_xor_value;
    uint64_t value;
};

void insert_data(Data const& e, HashEntry* h, std::size_t tableOffset)
{
    h[tableOffset].key_xor_value = e.key ^ e.value;
    h[tableOffset].value = e.value;
}

bool data_is_present(Data const& e, HashEntry const* h, std::size_t tableOffset)
{
    auto const tmp_key_xor_value = h[tableOFfset].key_xor_value;
    auto const tmp_value = h[tableOffset].value;

    return e.key == (tmp_key_xor_value ^ tmp_value);
}

The idea is that a HashEntry struct stores the XOR-ed combination of the two 64-bit words of a Data struct. If two threads have interleaved reads/writes to the two 64-bit words of a HashEntry struct, the idea is that this can be detected by the reading thread by XOR-ing again and comparing against the original key. So one might have a loss of efficiency by corrupted hash entries, but still have guaranteed correctness in case the decoded retrieved key matches the original.

The paper mentions that it is based on the following assumption:

For the remainder of this discussion, assume that 64 bit memory read/write operations are atomic, that is the entire 64 bit value is read/written in one cycle.

My questions are: is the above code without explicit use of std::atomic<uint64_t> guaranteed to be thread-safe in C++11? Or can the individual 64-bit words be corrupted by simultaneous reads/writes? Even on 64-bit platforms? And how is this different from the old C++98 Standard?

Quotes from the Standard would be much appreciated.

UPDATE: based on this amazing paper by Hans Boehm on "benign" data races, a simple way to get bitten is for the compiler to cancel both XORs from insert_data() and data_is_present() to alway return true, e.g. if it finds a local code fragment like

insert_data(e, h, t);
if (data_is_present(e, h, t)) // optimized to true as if in single-threaded code
   read_and_process(e, h, t); // data race if other thread has written
Fresno answered 20/9, 2012 at 7:16 Comment(6)
Aside from anything else, each thread can have its own independent cache as far as the standard is concerned. If you don't use atomics or some other form of synchronization, then that cache is not guaranteed ever to be refreshed. So aside from whether you read consistent data (which is what the question is about), you might separately worry about how stale the data is that you're reading. Even when your hardware makes guarantees about cache coherence, the compiler can cache whatever it likes in registers.Plebeian
Oh, and the fields of HashEntry could be on separate cache lines, so another legal thing is that one is fresh and the other is stale, forever, and data_is_present always returns false. Obviously anything's legal because a data race is UB, I'm just giving examples of things that could actually happen with only a plausible change to the hardware guarantees you're used to, or only a plausible amount of optimization by the compiler. The standard does not assume x64 ;-)Plebeian
@SteveJessop data_is_present() always returning false is inefficient but not incorrect. Returning true when it is actually false would lead to a real bug. If one would run this code on a x64 platform with a compiler knowing that target, how could that plausibly happen? I'm just asking because people using this type of code haven't really reported such behavior (of course, they might not have looked too hard).Fresno
I'm not sure whether it could plausibly happen on x64, but what you get in return for asking what the standard guarantees is a lot of people thinking about other platforms. IIRC ARM's 32bit multi-CPU architectures didn't generally have coherent caches. I don't know whether their 64 bit are the same or not, but the answer to your question "even on 64-bit platforms" for this issue might well be "yes". Presumably if nobody has spotted a bug in this code on their system, then either it works on their system (at least almost all the time) or they haven't tested it thoroughly.Plebeian
I think that atomic, sequenced, 64-bit ops on eventually-consistent caches would be enough to make this code work. So the issue really is whether (a) the hardware or (b) the compiler on a 64 bit system would fail to provide those properties. To which the answer is that it's allowed to fail (as Nicol says), it probably hasn't failed for the people using this code, and whether it fails or not might depend on implementation-dependent things such as optimization level or MSVC's special meaning for volatile (if you use volatile and MSVC).Plebeian
@SteveJessop I finally found a simple way to break it based on a paper by Hans Boehm. May need a Sufficiently Smart and Optimizing Compiler, and a very small ratio of hash entries to cores, but eventually you will get bitten.Fresno
I
8

The C++11 specification defines pretty much any attempt by one thread to read or write a memory location that another thread is writing to as undefined behavior (absent the use of atomics or mutexes to prevent read/writes from one thread while another thread is writing).

Individual compilers may make it safe, but the C++11 specification doesn't provide coverage itself. Simultaneous reads are never a problem; it's writing in one thread while reading/writing in another.

And how is this different from the old C++98 Standard?

The C++98/03 standard doesn't provide any coverage with regard to threading. As far as the C++98/03 memory model is concerned, threading is not a thing that can possibly happen.

Icky answered 20/9, 2012 at 7:29 Comment(5)
+1 for the UB. But on any decent modern 64-bit architecture/compiler, wouldn't reads/writes to a 64-bit variable be atomic? Since this code already exists in the real world, how likely are people to get bitten?Fresno
@rhalbersma: No, they won't be. While the architecture is capable of such atomic reads/writes, the compiler won't know that they're needed and so will only use them if they're optimum which, surprisingly, they aren't always.Muticous
@DavidSchwartz Tnx. So people have generally been lucky if they haven't been bitten by this? The above code is typically used in chess programs with 2-8 threads, and upward of 1M hash entries. I guess it's very unlikely that it would crash any program, but nice to know that it is not guaranteed (and never was, for that matter).Fresno
@rhalbersma: They may have been lucky. They may have inspected the assembly code. But this is very, very bad. The next version of the compiler or even a CPU upgrade might break this kind of code. If you absolutely need the performance, as demonstrated by benchmarks and the like, it maybe your "least awful" choice. But nobody should start out doing things this way.Muticous
@rhalbersma - there are two issues that atomic types address: tearing (switching threads when only part of a value has been written or read) and visibility (ensuring that other threads see the changes made in the thread that writes a value). Even if 64-bit stores are guaranteed to require only one buss cycle, that doesn't address visibility. As others have said, the C++ standard is quite clear: writing to a location while another thread is reading, absent language level synchronization, produces undefined behavior.Elenaelenchus
P
2

I dont think it depends so much on the compiler as on the CPU (its instruction set) you are using. I wouldnt think the assumption would be very portable.

Protege answered 20/9, 2012 at 7:21 Comment(9)
Quite the reverse, it depends entirely on the compiler. For example, even if the CPU has atomic 64-bit read/write operations, that won't help if the compiler doesn't always choose to use them. And even if the CPU doesn't have atomic 64-bit read/write operations, that won't hurt if the compiler realizes the code requires them and creates them using locks or some other mechanism.Muticous
He's talking about std::atomic, not CPU intrinsics.Nonunionism
@DavidSchwartz, If a CPU supports atomic 64bit read/write and the compiler doesnt use it, well that seems incorrect. Either way, expecting 64bit read/write operations to be portable without explicitly defining them as so or protecting them isnt portable.Protege
@Brady: Why is that "incorrect" in a context where the compiler decides a non-atomic operation is better for some reason? (And there are such cases on x86!) What rule requires it to generate worse code than it's capable of generating?Muticous
@DavidSchwartz, I cant think of such a context (which doesnt mean much :) If the compiler correctly decides when it should be atomic, then you're correct. Just out of curiosity, can you give a reference to said case, please. Thanks!Protege
@Brady: Some common cases are 1) where the compiler thinks there will be a write most of the time and so writes the "unchanged" value back even where there shouldn't be a write to avoid a conditional branch and 2) where the compiler "can tell" that a write has no effect and so doesn't do the write. 3) where the compiler eliminates a write because it "can tell" that a subsequent write changes the value before the value is read. In all three of these cases, the compiler deduces something that would be true if there were no other threads and so you don't get the most natural atomic assembly.Muticous
@DavidSchwartz If the generated assembly code for both insert_entry() and data_is_present() really does use atomic 64-bit reads/writes, then for that platform everything is safe, correct?Fresno
@rhalbersma: If you hand inspect the generated assembly and you know the instruction set well enough to know it's safe, then that generated assembly is safe. However, the next version of the compiler or a change in the optimization flags might break things for you. It's much easier just to specify appropriate atomic operations, that way you're guaranteed to get the right generated assembly. (So while that would work, I can't imagine any situation where it would be your best choice.)Muticous
@DavidSchwartz Hence my original point of it not being portable if not explicitly defining them as atomic.Protege
S
2

The code's totally broken. The compiler's has substantial freedom to reorder instructions if its analysis suggests the overall effect is identical. In insert_data for example, there's no guarantee that key_xor_value will be updated before the value, whether the updates are done on temporary registers before being written back into the cache, let alone when those cache updates - whatever their "order" in the machine code language and CPU instruction execution pipeline - will be flushed from the updating core's or cores' (if context-switched mid-function) private caches to become visible to other threads. The compiler might even do the updates in steps using 32 bit registers, depending on the CPU, whether compiling 32-bit or 64-bit, compilation options etc..

Atomic operations tend to require something like CAS (Compare and Swap) style instructions, or volatile and memory barrier instructions, that sync data across cores' caches and enforce some ordering.

Scot answered 20/9, 2012 at 8:46 Comment(1)
std::atomic does a bit more than that; it also guarantees that the compiler does not reorder reads and/or writes. The load/store operations have flags to tell compiler how strict you want it to be. The simplest form which has full sequential consistency also leaves least room for optimization but is usually the best starting point before reasoning any further how to make the code more efficient. It's pretty sweet addition to the standard; I did throw my abstraction for this stuff out the window early in 2015 (before that had too many quirks with less mainstream tool chains).Gammadion

© 2022 - 2024 — McMap. All rights reserved.