When I asked this question I was confused...
When I originally asked this question, I was confused because I had no conceputal understanding of how a "mutex" functions in hardware, whereas I did have a conceptual understanding of many other things that exist in hardware. (For example, how a compiler converts text into machine readable instructions. How cache and memory work. How graphics or coprocessors work. How network hardware and interfaces work, etc.)
Misconception 1: Mutex does not lock memory locations
When I first heard about Mutex, long before writing this question, I misunderstood a mutex to be a feature which locks regions of memory. (That region might be global.)
This is not what happens. Other threads and processes can continue to access main memory and cache if another thread locks a mutex. You can see immediatly why such a design would be inefficient, since it would block all other system processes, for the sake of synchronizing one.
Misconception 2: The scope in which a mutex object is declared is irrelevant
The context of this is C code, and C like languages where you have scoped blocks defined by {
and }
however the same logic could apply to Python where scope is defined by indentation.
I believe that this misunderstanding came from the existance of scoped_lock
objects, and similar concepts where scope is used to manage the lifetime (locking and unlocking, resources) of a Mutex object.
One could also argue that since pointers and references to a Mutex can be passed around a program, the scope of a Mutex couldn't be used to define what variables are "locked" by a mutex.
For example, I misunderstood the following snippet:
{
int x, y, z;
Mutex m;
m.lock();
}
I believed that the above snippet would lock access to variables x, y and z from all other threads because x, y and z are declared in the same scope as the mutex m. This is also not how a mutex works.
Understanding 1: Mutex is typically implemented in hardware using atomic operations
Atomic operations are completely seperate from the concept of mutex, however they are a prerequisite to understanding how a mutex can exist, and how it can work.
When a CPU executes something like c = a + b
, this involves a sequence of individual (atomic) operations. The word Atom is derived from Atomos meaning "indivisible", or "fundamental". (Atoms are divisible, but when theorists of Ancient Greece originally concieved of the objects from which matter was composed, they assumed that particles must be divisible down to some fundamental smallest possible component, which itself is indivisible. They were not too far wrong, since an atom is made from other fundamental particles which so far we understand to be indivisible.)
Returning to the point: c = a + b
is something like the following:
- load a from memory into register 1
- load b from memory into register 2
- do operation add: add contents of register 2 to register 1, result is in register 1
- save register 1 to memory c
The add operation might take several clock cycles, and loading/saving to memory takes typically of order 100 clock cycles on modern x86 machines. However each operation is atomic in the sense that a single CPU instruction is being completed, and this instruction cannot be divided into any smaller step of smaller instructions. The instructions are themselves fundamental computing operations.
With that understood, there exists a set of atomic instructions which can do things such as:
- load a value from memory increment it and save it to memory
- load a value from memory decrement it and save it to memory
- load a value from memory, compare it to a value which is already loaded into a register, and branch depending on the comparison result
Note that such operations are typically significantly slower than their non-atomic sequence counterparts. This is because optimizations such as pipelining are forfit when executing the above instructions. (I think?)
At this point my knowledge becomes a bit less accurate and more hand-wavey, but as far as I understand, these operations are typically implemented by having some digital logic inside the processor which blocks all other processes from running while these atomic operations (listed above) are executing.
Meaning: If there are 8 CPU cores running, if one core encounters an instruction like the above, it signals the other cores to stop running until it has finished that atomic operation. (It is at least something approximatly along these lines.)
Understanding 2: Actual mutex operation
Given the above, it is possible to implement a mutex using these atomic machine instructions. Other answers posted here suggest possible ways of doing it including something similar to reference counting. (Semaphore.)
How an acutal mutex in C++ works is this:
- Each mutex object has a variable in memory associated with it, the value of this variable indicates whether a mutex is locked or not
- This mutex variable is updated using the special atomic operations that a CPU supports for the purpose of allowing a mutex to be programmed
- Elsewhere in memory there are some other variables/data which you want to protect/synchronize access to
- This synchronization is done using the mutex variable/data
- Before a thread reads/writes to some data/variable which needs to be accessed mutually exclusively by all threads which operate on it, that thread must first "lock" the special mutex data/variable
- This is done using the atomic operations built into a CPU for the purpose of supporting mutex programming
So you see, the data which is "locked" and accessed mutually exclusively is entirely independent from the actual data used to store the state of the mutex.
- If another thread wants to read/write the data which must be accessed mutually exclusively, it will try to lock the mutex. If the mutex is already locked, that means another thread has the right to access this data, and no other thread is permitted to, therefore this thread will typically go to sleep, and will be re-woken by the operating system when the mutex is next unlocked.
It is important to note the operating system thread (kernel) is critically involved in the mutex process. Typically, before a thread sleeps, it will tell the operating sytem that it wishes to be woken up again when the mutex is free. The operating system is also notified when other threads lock or unlock a mutex. Hence synchronization of information about the state of a mutex is passed via messages through the operating system kernel.
This is why writing a multiple thread OS kernel is (proabably) impossible (if not very difficult). I don't know if this has actually been done successfully. It sounds like a difficult problem which might be the subject of current CS research.
This is pretty much everything I know about the subject. Obviously my knowledge is not absolute...
Note: Feel free to correct my Greek history or x86 Machine Instruction knowledge in the comments section. No doubt not everything here is perfectly accurate.