Here is an article over at software.intel.com on that sheds little light on user level locks:
User level locks involve utilizing the atomic instructions of
processor to atomically update a memory space. The atomic instructions
involve utilizing a lock prefix on the instruction and having the
destination operand assigned to a memory address. The following
instructions can run atomically with a lock prefix on current Intel
processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC,
NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. [...] On most instructions
a lock prefix must be explicitly used except for the xchg instruction
where the lock prefix is implied if the instruction involves a memory
address.
In the days of Intel 486 processors, the lock prefix used to assert a
lock on the bus along with a large hit in performance. Starting with
the Intel Pentium Pro architecture, the bus lock is transformed into a
cache lock. A lock will still be asserted on the bus in the most
modern architectures if the lock resides in uncacheable memory or if
the lock extends beyond a cache line boundary splitting cache lines.
Both of these scenarios are unlikely, so most lock prefixes will be
transformed into a cache lock which is much less expensive.
So what prevents another core from accessing the memory address? The cache coherency protocol already manages access rights for cache lines. So if a core has (temporal) exclusive access rights to a cache line, no other core can access that cache line. To access that cache line the other core has to obtain access rights first, and the protocol to obtain those rights involves the current owner. In effect, the cache coherency protocol prevents other cores from accessing the cache line silently.
If the locked access is not bound to a single cache line things get more complicated. There are all kinds of nasty corner cases, like locked accesses over page boundaries, etc. Intel does not tell details and they probably use all kinds of tricks to make locks faster.