In what circumstances can large pages produce a speedup?
Asked Answered
S

5

20

Modern x86 CPUs have the ability to support larger page sizes than the legacy 4K (ie 2MB or 4MB), and there are OS facilities (Linux, Windows) to access this functionality.

The Microsoft link above states large pages "increase the efficiency of the translation buffer, which can increase performance for frequently accessed memory". Which isn't very helpful in predicting whether large pages will improve any given situation. I'm interested in concrete, preferably quantified, examples of where moving some program logic (or a whole application) to use huge pages has resulted in some performance improvement. Anyone got any success stories ?

There's one particular case I know of myself: using huge pages can dramatically reduce the time needed to fork a large process (presumably as the number of TLB records needing copying is reduced by a factor on the order of 1000). I'm interested in whether huge pages can also be a benefit in less exotic scenarios.

Subtilize answered 20/5, 2010 at 17:42 Comment(0)
S
11

I tried to contrive some code which would maximise thrashing of the TLB with 4k pages in order to examine the gains possible from large pages. The stuff below runs 2.6 times faster (than 4K pages) when 2MByte pages are are provided by libhugetlbfs's malloc (Intel i7, 64bit Debian Lenny ); hopefully obvious what scoped_timer and random0n do.

  volatile char force_result;

  const size_t mb=512;
  const size_t stride=4096;
  std::vector<char> src(mb<<20,0xff);
  std::vector<size_t> idx;
  for (size_t i=0;i<src.size();i+=stride) idx.push_back(i);
  random0n r0n(/*seed=*/23);
  std::random_shuffle(idx.begin(),idx.end(),r0n);

  {
    scoped_timer t
      ("TLB thrash random",mb/static_cast<float>(stride),"MegaAccess");
    char hash=0;
    for (size_t i=0;i<idx.size();++i) 
      hash=(hash^src[idx[i]]);
    force_result=hash;
  }

A simpler "straight line" version with just hash=hash^src[i] only gained 16% from large pages, but (wild speculation) Intel's fancy prefetching hardware may be helping the 4K case when accesses are predictable (I suppose I could disable prefetching to investigate whether that's true).

Subtilize answered 21/5, 2010 at 18:43 Comment(1)
Hardware prefetching won't cross 4k page boundaries, but what you are probably seeing in the straight line case is that the page table accesses are very predictable, so the page walk that occurs when you miss in the TLB likely hits pages which are all in L1 (these page entries may indeed have been brought in via prefetching).Brawn
T
18

The biggest difference in performance will come when you are doing widely spaced random accesses to a large region of memory -- where "large" means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4kB pages is often larger than the number of entries for 2MB pages, but this varies a lot by processor. There is also a lot of variation in how many "large page" entries are available in the Level 2 TLB.

For example, on an AMD Opteron Family 10h Revision D ("Istanbul") system, cpuid reports:

  • L1 DTLB: 4kB pages: 48 entries; 2MB pages: 48 entries; 1GB pages: 48 entries
  • L2 TLB: 4kB pages: 512 entries; 2MB pages: 128 entries; 1GB pages: 16 entries

While on an Intel Xeon 56xx ("Westmere") system, cpuid reports:

  • L1 DTLB: 4kB pages: 64 entries; 2MB pages: 32 entries
  • L2 TLB: 4kB pages: 512 entries; 2MB pages: none

Both can map 2MB (512*4kB) using small pages before suffering level 2 TLB misses, while the Westmere system can map 64MB using its 32 2MB TLB entries and the AMD system can map 352MB using the 176 2MB TLB entries in its L1 and L2 TLBs. Either system will get a significant speedup by using large pages for random accesses over memory ranges that are much larger than 2MB and less than 64MB. The AMD system should continue to show good performance using large pages for much larger memory ranges.

What you are trying to avoid in all these cases is the worst case (note 1) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (note 2) work, it requires:

  • 5 trips to memory to load data mapped on a 4kB page,
  • 4 trips to memory to load data mapped on a 2MB page, and
  • 3 trips to memory to load data mapped on a 1GB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD's "AMD64 Architecture Programmer’s Manual Volume 2: System Programming" (publication 24593) http://support.amd.com/us/Embedded_TechDocs/24593.pdf

Note 1: The figures above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 2: Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

Those answered 12/3, 2012 at 21:20 Comment(0)
S
11

I tried to contrive some code which would maximise thrashing of the TLB with 4k pages in order to examine the gains possible from large pages. The stuff below runs 2.6 times faster (than 4K pages) when 2MByte pages are are provided by libhugetlbfs's malloc (Intel i7, 64bit Debian Lenny ); hopefully obvious what scoped_timer and random0n do.

  volatile char force_result;

  const size_t mb=512;
  const size_t stride=4096;
  std::vector<char> src(mb<<20,0xff);
  std::vector<size_t> idx;
  for (size_t i=0;i<src.size();i+=stride) idx.push_back(i);
  random0n r0n(/*seed=*/23);
  std::random_shuffle(idx.begin(),idx.end(),r0n);

  {
    scoped_timer t
      ("TLB thrash random",mb/static_cast<float>(stride),"MegaAccess");
    char hash=0;
    for (size_t i=0;i<idx.size();++i) 
      hash=(hash^src[idx[i]]);
    force_result=hash;
  }

A simpler "straight line" version with just hash=hash^src[i] only gained 16% from large pages, but (wild speculation) Intel's fancy prefetching hardware may be helping the 4K case when accesses are predictable (I suppose I could disable prefetching to investigate whether that's true).

Subtilize answered 21/5, 2010 at 18:43 Comment(1)
Hardware prefetching won't cross 4k page boundaries, but what you are probably seeing in the straight line case is that the page table accesses are very predictable, so the page walk that occurs when you miss in the TLB likely hits pages which are all in L1 (these page entries may indeed have been brought in via prefetching).Brawn
S
3

I've seen improvement in some HPC/Grid scenarios - specifically physics packages with very, very large models on machines with lots and lots of RAM. Also the process running the model was the only thing active on the machine. I suspect, though have not measured, that certain DB functions (e.g. bulk imports) would benefit as well.

Personally, I think that unless you have a very well profiled/understood memory access profile and it does a lot of large memory access, it is unlikely that you will see any significant improvement.

Sieve answered 20/5, 2010 at 20:49 Comment(0)
P
3

This is getting esoteric, but Huge TLB pages make a significant difference on the Intel Xeon Phi (MIC) architecture when doing DMA memory transfers (from Host to Phi via PCIe). This Intel link describes how to enable huge pages. I found increasing DMA transfer sizes beyond 8 MB with normal TLB page size (4K) started to decrease performance, from about 3 GB/s to under 1 GB/s once the transfer size hit 512 MB.

After enabling huge TLB pages (2MB), the data rate continued to increase to over 5 GB/s for DMA transfers of 512 MB.

Phelloderm answered 15/10, 2014 at 23:46 Comment(0)
B
2

I get a ~5% speedup on servers with a lot of memory (>=64GB) running big processes. e.g. for a 16GB java process, that's 4M x 4kB pages but only 4k x 4MB pages.

Bolivia answered 21/5, 2010 at 7:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.