I need to read/write 16 bytes atomically. I do the writing only using cmpxchg16, which is available on all x64 processors except I think for one obscure AMD one.
Now the question is for aligned 16 byte values, only ever modified using cmpxchg16 (which acts like a full memory barrier) is it ever possible to read a 16 byte location that's half old data and half new data?
As long as I read with a SSE instruction (so the thread cannot be interrupted in the middle of the read) I think that it's impossible (even in multiprocessor numa systems) for the read to see inconsistent data. I think it must be atomic.
I am making the assumption that when cmpxchg16 is executed, it modifies the 16 bytes atomically, not by writing two 8 byte blocks with the potential for other threads to do a read in between (honestly I don't see how it could work if it wasn't atomic.)
Am I right? If I'm wrong, is there a way to do an atomic 16 byte read without resorting to locking?
Note: There are a couple similar questions here but they don't deal with the case where the writes are done only with cmpxchg16, so I feel this is a seperate, unanswered question.
Edit: Actually I think my reasoning was faulty. An SSE load instruction may be executed as two 64bit reads, and it may be possible for the cmpxchg16 to be executed in between the two reads by another processor.