Does mmap directly access the page cache, or a copy of the page cache?
Asked Answered
B

1

30

To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?

I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).

Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)

However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.

That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.

But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).

Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?

Thanks, I appreciate your help!

Boot answered 13/9, 2017 at 6:54 Comment(3)
That's a great question. I don't think I'll be able to answer, but offer a few suggestions. 1/ why not profile it ? perf should be able to tell you where the bottleneck is quite easily (I hope). My guess is that you're hitting the mmap (small) overhead, but that at 192 threads, it doesn't scale. Also, have you tried using Huge Pages ?Bellwort
It's tricky to profile as all the interesting stuff is happening deep inside the kernel. So far as my application knows, it's just accessing RAM -- but between memory mapping, virtual memory, page caches, L3 caches, and NUMA nodes, there are a lot of moving parts to nail down. That said I agree there is more work to be done to figure this out, but I'm hoping someone with a better knowledge of the kernel than me can give some advice on what should happen at least in theory, as that'll guide my testing in practice.Boot
Yeah, but usually perf knows where the kernel is spending time if you have the proper symbols attached. Regarding your question, I have no idea what's the source of the issue. Did you try to reproduce it on a smaller machine ?Bellwort
B
13

I think I found my answer, and it deals with the page directory. The answer is yes, two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages -- meaning 2x as many entries in the page directory to access the same RAM.

Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory, and that mapping is stored in a hierarchical page directory -- with one entry per 4KB page. So every mmap() of a large region generates a huge number of entries in the page directory.

My guess is it doesn't actually define them all up front, which is why mmap() is instant to call even for a giant file. But over time it probably has to establish those entries as there are faults on the mmapped range, meaning over the course of time it gets filled out. This extra work to populate the page directory is probably why threads using different mmaps are slower than threads sharing the same mmap. And I bet the kernel needs to erase all those entries when unmapping the range -- which is why unmmap() is so slow.

(There's also the translation lookaside buffer, but that's per-CPU, and so small I don't think that matters much here.)

Anyway, it sounds like re-mapping the same region just adds extra overhead, for what seems to me like no gain.

Boot answered 17/9, 2017 at 4:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.