After some research I found that if two cores store the same cache line and
one of them modify it then the second one has to reread entire line from main memory. https://en.wikipedia.org/wiki/MESI_protocol.
This is not correct. The cache is the source of truth because caches (at least on X86) are always coherent. So in theory a cacheline never needs to be read from main memory; it could always be served from the one of the CPU caches. If a different CPU cache would need a cacheline, it can just read the value from the other caches. With MESI it can happen that a cacheline is flushed to main memory when the cacheline is in modified state and a different CPU wants to read it; but otherwise no communication with main memory is needed. This is because MESI doesn't support dirty sharing; MOESI solves that problem.
But it still unclear for me why hardware forces CPU to reread it.
I mean that is why we do have a volatile keyword in Java right ?
Caches on the X86 are always coherent. No special CPU instructions are needed for this; it is out of the box behavior. So it can't happen that e.g. the value A=1 is written to some cacheline, while a later read still sees the old value A=0.
If variable is declared as volatile then threads will skip this variable
from cache and always read/write it from/to main memory.
If hardware forces cpu to reread cache lines after every write then how data inconsistency is possible in multi threaded applications?
This is not correct. Caches are the source of truth; there is no 'force reading from main memory'. There are special instructions that can bypass the CPU caches called non temporal loads and stores, but they are not relevant for this discussion.
The purpose of volatile is to make sure that the ordering with respect to other loads and stores to different addresses is preserved and that a stores are visible to other threads.
In case of false sharing; if the CPU's modify different parts of the same cacheline and a CPU need to write and the other CPU has just written to it, the first CPU needs to invalidate the cacheline on the other CPU with a RFO (Request For Ownership) once the write hits the linefillbuffer and it can't continue with the write until this RFO has been acknowledged. But as soon as the other CPU wants to write to that cacheline, it needs to send a RFO and wait for acknowledgement.
So you get a lot of cache coherence traffic between the different CPU's.. continuously fighting over the same cacheline. And if you are out of luck, there are no out of order instructions the CPU can execute, so effectively the CPU's will be mostly idle even though you have 100% CPU utilization.