What does the "lock" instruction mean in x86 assembly?
Asked Answered
H

4

89

I saw some x86 assembly in Qt's source:

q_atomic_increment:
    movl 4(%esp), %ecx
    lock 
    incl (%ecx)
    mov $0,%eax
    setne %al
    ret

    .align 4,0x90
    .type q_atomic_increment,@function
    .size   q_atomic_increment,.-q_atomic_increment
  1. From Googling, I knew lock instruction will cause CPU to lock the bus, but I don't know when CPU frees the bus?

  2. About the whole above code, I don't understand how this code implements the Add?

Hebraism answered 17/1, 2012 at 7:33 Comment(2)
Please see https://mcmap.net/q/14943/-x86-lock-question-on-multi-core-cpusInnervate
related: my answer on Can num++ be atomic for 'int num'? explains atomicity on x86, and what exactly the lock prefix does, and what would happen without it.Penurious
P
135
  1. LOCK is not an instruction itself: it is an instruction prefix, which applies to the following instruction. That instruction must be something that does a read-modify-write on memory (INC, XCHG, CMPXCHG etc.) --- in this case it is the incl (%ecx) instruction which increments the long word at the address held in the ecx register.

    The LOCK prefix ensures that the CPU has exclusive ownership of the appropriate cache line for the duration of the operation, and provides certain additional ordering guarantees. This may be achieved by asserting a bus lock, but the CPU will avoid this where possible. If the bus is locked then it is only for the duration of the locked instruction.

  2. This code copies the address of the variable to be incremented off the stack into the ecx register, then it does lock incl (%ecx) to atomically increment that variable by 1. The next two instructions set the eax register (which holds the return value from the function) to 0 if the new value of the variable is 0, and 1 otherwise. The operation is an increment, not an add (hence the name).

Press answered 17/1, 2012 at 8:46 Comment(6)
So the instuction "mov $0,%eax" seems redundant?Hebraism
@gemfield: No, the MOV sets the all of EAX to zero. SETNE only changes the low byte. Without the MOV, the 3 high bytes of EAX would contain random leftover values from previous operations, so the return value would be incorrect.Press
In one of the Russia book "Assembler for DOS, Windows и Linux, 2000. Sergei Zukkov" author mentioned the following about this prefix: "In all the time of the command, provided with such a prefix, the data bus will be suspended, and if a system has a different processor, it can not access memory until the end of the command with the prefix LOCK. XCHG command automatically always performed with the memory access lock, even if the LOCK prefix is not specified. This prefix can be used only with commands ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD and XCHG."Formerly
@bruziuz: modern CPUs are much more efficient: if the data for a locked instruction doesn't cross a cache line, a CPU core can just internally lock that cache line instead of blocking all loads/stores from all other cores. See also my answer on Can num++ be atomic for 'int num'? for more details of how this works to make it appear atomic to possible observers using the MESI cache-coherency protocol.Penurious
Thank you very much! Cool! :)Formerly
See also LOCK prefix vs MESI protocol? re: the MESI cache-lock part specifically, for a plain add vs. lock add. (Cache-line-split locked operations are disastrously expensive, affecting memory access by all other cores. So don't do that.)Penurious
F
18

The microcode required to increment a value requires that we read in the old value first.

The Lock keyword forces the multiple micro instructions that are actually occuring to appear to operate atomically.

If you had 2 threads each trying to increment the same variable, and they both read the same original value at the same time then they both increment to the same value, and they both write out the same value.

Instead of having the variable incremented twice, which is the typical expectation, you end up incrementing the variable once.

The lock keyword prevents this from happening.

Favor answered 23/3, 2013 at 23:36 Comment(0)
P
15

From google, I knew lock instruction will cause cpu lock the bus,but I don't know when cpu free the bus ?

LOCK is an instruction prefix, hence it only applies to the following instruction, the source doesn't make it very clear here but the real instruction is LOCK INC. So the Bus is locked for the increment, then unlocked

About the whole above code, I don't understand how these code implemented the Add?

They don't implement an Add, they implement an increment, along with a return indication if the old value was 0. An addition would use LOCK XADD (however, windows InterlockedIncrement/Decrement are also implement with LOCK XADD).

Propagate answered 17/1, 2012 at 7:47 Comment(7)
Thanks! Then which register stores the value of function(q_atomic_increment)'s return value ?Hebraism
return values are stored in %eaxNovelist
So,the code: "return q_atomic_increment(&_q_value) != 0" is to test whether %eax is not equal to zero ?Hebraism
@gemfield: its zero'd, then the LSB is set via SETNE using the conditional flags from INC.Propagate
Is it whether the old value was 0 or not that's returned in %eax (as the answer currently states), or the new value?Proffer
@Proffer it's the new value indeed. the inc set the zero flag according to the result of the inc and not according to the sources. see: c9x.me/x86/html/file_module_x86_id_140.htmlDialectal
Locking the bus was done with 486/Pentium. This is very inefficient because is creates a huge contention point and this reduces performance of the system (Amdahl's law). That is why they switched to locking on a cache line level in most cases. For more information see: intel.com/content/www/us/en/architecture-and-technology/… Section 8.1.4Pericarditis
B
4

Minimal runnable C++ threads + LOCK inline assembly example

main.cpp

#include <atomic>
#include <cassert>
#include <iostream>
#include <thread>
#include <vector>

std::atomic_ulong my_atomic_ulong(0);
unsigned long my_non_atomic_ulong = 0;
unsigned long my_arch_atomic_ulong = 0;
unsigned long my_arch_non_atomic_ulong = 0;
size_t niters;

void threadMain() {
    for (size_t i = 0; i < niters; ++i) {
        my_atomic_ulong++;
        my_non_atomic_ulong++;
        __asm__ __volatile__ (
            "incq %0;"
            : "+m" (my_arch_non_atomic_ulong)
            :
            :
        );
        __asm__ __volatile__ (
            "lock;"
            "incq %0;"
            : "+m" (my_arch_atomic_ulong)
            :
            :
        );
    }
}

int main(int argc, char **argv) {
    size_t nthreads;
    if (argc > 1) {
        nthreads = std::stoull(argv[1], NULL, 0);
    } else {
        nthreads = 2;
    }
    if (argc > 2) {
        niters = std::stoull(argv[2], NULL, 0);
    } else {
        niters = 10000;
    }
    std::vector<std::thread> threads(nthreads);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i] = std::thread(threadMain);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i].join();
    assert(my_atomic_ulong.load() == nthreads * niters);
    assert(my_atomic_ulong == my_atomic_ulong.load());
    std::cout << "my_non_atomic_ulong " << my_non_atomic_ulong << std::endl;
    assert(my_arch_atomic_ulong == nthreads * niters);
    std::cout << "my_arch_non_atomic_ulong " << my_arch_non_atomic_ulong << std::endl;
}

GitHub upstream.

Compile and run:

g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o main.out main.cpp -pthread
./main.out 2 10000

Possible output:

my_non_atomic_ulong 15264
my_arch_non_atomic_ulong 15267

From this we see that the LOCK prefix made the addition atomic: without it we have race conditions on many of the adds, and the total count at the end is less than the synchronized 20000.

The LOCK prefix is used to implement:

See also: What does multicore assembly language look like?

Tested in Ubuntu 19.04 amd64.

Benia answered 28/6, 2019 at 9:0 Comment(3)
What's the point of using -O0, and fencing the non-atomic increment with a full barrier (lock inc)? To prove that it's still broken even in the best-case scenario? You'd see many more lost counts if you let non-locked inc forward from the store buffer.Penurious
@PeterCordes -O0: hadn't put much thought into it, done by default for better debug, although I laster noticed that it does make it a bit easier to see the behavior i such a simple case because -O3 optimizes loop to a single add. "and fencing the non-atomic increment with a full barrier": does LOCK also affect the non atomic variables on the above program?Benia
lock inc is a full barrier, like mfence. You don't have 4 separate loops, you interleave increments. It doesn't make the other inc atomic, but it forces inc's store to be globally visible before the next inc's load, so yes it affects it significantly. If you don't want -O3 to hoist out of the loop and do += N, you can use volatile; constraining code-gen without giving any kind of atomicity is what volatile is for.Penurious

© 2022 - 2024 — McMap. All rights reserved.