Settle in kids this is going to be a long one.
First lest discuss CAS (Compare And Swap) this is not a synchronization mechanism. It is a atomic operation that allows us to update a value in main memory, simultaneity testing if that value has not changed (or is what we expect it to be). There is no locking involved. Although they are used by some synchronization primitives (semaphores, mutexes). Lest's take a look on the following example:
a = 1;
--------------------------------
Thread 1 | Thread 2
b = 1 + a | b = 2 + a
cas(*a, 1, b ) | cas(*a, 1, b )
Now one of the CAS-es will fail, and what I mean by that is that it will return false. The other will return true and the value that pointer *a represents will be updated with new value. If we didn't use CAS but instead just updated the value, like this:
a = 1;
--------------------------------
Thread 1 | Thread 2
b = 1 + a | b = 2 + a
a = b | a = b
At the end of this computation the a could be 2 or 3 and both threads would complete happily not knowing what value was saved in a. This is what is called a data race and CAS is a way to solved that.
The existence of CAS enable us to write some lock-free algorithms (no synchronization needed) like collections in the java.util.concurrent package, that do not need to be synchronized, to be accessed concurrently.
Now I mentioned that CAS is used to implement synchronization. That is why the cost of acquiring a lock and perform a CAS is almost the same (if there is no contention !!!!) And in that send you get hardware support for the synchronized key word.
synchronized(this){
n = n + 1;
}
AtomicLong al = new AtomicLong();
al.updateAndGet( n -> n + 1)
The performance hit that you might get when using CAS vs synchronize comes from When your CAS fails you can just retry while usage of synchronize might result in thread going to sleep to the os. Going in to the rabbit hole of context switches (that might or might not happen :) depending on the os).
Now for the notify(), notifyAll() and wait()
. Calls directly to the thread scheduler that is part of the OS. The scheduler has two queues Wait Queue and Run Queue. When you invoke the wait on the thread, that thread is placed in the wq and sit's there until it get's notify and place in the rq for to be executed as soon as possible.
In Java there are two basic thread synchronization one via (wait(), notify()) is called cooperation and other via locks called mutual exclusion (mutex). And this are generally to parallel tracks to do thinks at once.
Now I don't know how the synchronization was done before Java 5. But now you have 2 ways to synchronize using object (one of the might be old the other new).
Biased Lock. Thread id is put in object header and then when that same specific thread wants to lock, unlock that object that operation cost us nothing. This is why if our app has a lot of uncontended locks this can give us significant performance boost. As we can avoid a second path:
(this is probably the old one) using monitorenter/monitorexit
. This are bytecode instructions. That are placed on entry and exit of the synchronize {...}
statement. This is where the object identity becomes relevant. As it becomes part of the lock information.
OK, that it. I know i didn't answer the question fully. The subject is so complicated and so difficult. The chapter 17 in "Java Language Specification" titled: "Java Memory Model" is probably the only one that can't be read by regular programmers (Maybe the dynamic dispatch also falls under that category :)). My hope is that at least you will be able to google the correct words.
Couple of links:
https://www.artima.com/insidejvm/ed2/threadsynchP.html (monitorenter/monitorexit, explanation)
https://www.ibm.com/developerworks/library/j-jtp10185/index.html (how lock are optimized inside jvm)