On Intel, the arguments to CMPXCHG do NOT need to be cache aligned. Try it, you will see that it works.
But, you are correct: in cacheable memory, Intel does use the cache protocol to implement CMPXCHG. So, you would be smart to not put two independent high usage synchronization variables in the same cache line - because if two processors were synchronizing using these different variables, cache lines might be thrashing back and forth. But this is exactly the same issue as for any data: you don't different processors to be writing to the same cacheline at the same time. False sharing.
But you certainly can do not cache line aligned locks:
struct Foo {
int data;
Lock lock;
int data_after;
};
You can put different locks in the same cacheline:
struct Foo {
int data;
Lock read_lock;
int data_between;
Lock write_lock;
int data_after;
};
Since reading and writing tend to be mutually exclusive, there may be no lossage;
You can put different locks in the same cacheline:
struct Foo {
int data;
Lock read_lock;
int data_between;
Lock write_lock;
int data_after;
};
By the way, in uncached memory Intel does not use the cache snooping protocol for atomic operations like CMPXCHG. So there is less reason to cache line align synchronization variables. But you still may want to: many memory subsystems interleave by cacheline size, even when uncached.
And as for ARM: it is pretty much the same.
On a snoopy bus, or uncached, you may not need to worry too much about cache line alignment.
But in a clustered cache hierarchy, you have exactly the same issues as x86. More so, in fact, it is well known how to "export" operations like CMPXCHG, but not ARM ldrexd/strexd.