To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?
I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).
Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)
However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.
That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.
But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).
Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?
Thanks, I appreciate your help!