Using 1GB pages degrade performance

Asked 2/9, 2020 at 12:10 Answered 26/10, 2020 at 21:27

I have an application where I need about 850 MB of continuous memory and be accessing it in a random manner. I was suggested to allocate a huge page of 1 GB, so that it would always be in TLB. I've written a demo with sequential/random accesses to measure the performance for small (4 KB in my case) vs large (1 GB) page:

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>

#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) // Aren't used in this example.
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
#define MESSINESS_LEVEL 512 // Poisons caches if LRU policy is used.

#define RUN_TESTS 25

void print_usage() {
  printf("Usage: ./program small|huge1gb sequential|random\n");
}

int main(int argc, char *argv[]) {
  if (argc != 3 && argc != 4) {
    print_usage();
    return -1;
  }
  uint64_t size = 1UL * 1024 * 1024 * 1024; // 1GB
  uint32_t *ptr;
  if (strcmp(argv[1], "small") == 0) {
    ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, // basically malloc(size);
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (ptr == MAP_FAILED) {
      perror("mmap small");
      exit(1);
    }
  } else if (strcmp(argv[1], "huge1gb") == 0) {
    ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, -1, 0);
    if (ptr == MAP_FAILED) {
      perror("mmap huge1gb");
      exit(1);
    }
  } else {
    print_usage();
    return -1;
  }

  clock_t start_time, end_time;
  start_time = clock();

  if (strcmp(argv[2], "sequential") == 0) {
    for (int iter = 0; iter < RUN_TESTS; iter++) {
      for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
        ptr[i] = i * 5;
    }
  } else if (strcmp(argv[2], "random") == 0) {
    // pseudorandom access pattern, defeats caches.
    uint64_t index;
    for (int iter = 0; iter < RUN_TESTS; iter++) {
      for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
        for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
          index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
          ptr[index] = index * 5;
        }
      }
    }
  } else {
    print_usage();
    return -1;
  }

  end_time = clock();
  long double duration = (long double)(end_time - start_time) / CLOCKS_PER_SEC;
  printf("Avr. Duration per test: %Lf\n", duration / RUN_TESTS);
  //  write(1, ptr, size); // Dumps memory content (1GB to stdout).
}

And on my machine (more below) the results are:

Sequential:

$ ./test small sequential
Avr. Duration per test: 0.562386
$ ./test huge1gb sequential        <--- slightly better
Avr. Duration per test: 0.543532

Random:

$ ./test small random              <--- better
Avr. Duration per test: 2.911480
$ ./test huge1gb random
Avr. Duration per test: 6.461034

I'm bothered with the random test, it seems that a 1GB page is 2 times slower! I tried using madvise with MADV_SEQUENTIAL / MADV_SEQUENTIAL for respective tests, it didn't help.

Why does using a one huge page in case of random accesses degrades performance? What are the use-cases for huge pages (2MB and 1GB) in general?

I didn't test this code with 2MB pages, I think it should probably do better. I also suspect that since a 1GB page is stored in one memory bank it probably has something to do with multi-channels. But I would like to hear from you folks. Thanks.

Note: to run the test you must first enable 1GB pages in your kernel. You can do it by giving kernel this parameters hugepagesz=1G hugepages=1 default_hugepagesz=1G. More: https://wiki.archlinux.org/index.php/Kernel_parameters. If enabled, you should get something like:

$ cat /proc/meminfo | grep Huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:         1048576 kB

EDIT1: My machine has Core i5 8600 and 4 memory banks 4 GB each. The CPU natively supports both 2MB and 1GB pages (it has pse & pdpe1gb flags, see: https://wiki.debian.org/Hugepages#x86_64). I was measuring machine time, not CPU time, I updated the code and the results now are average of 25 tests.

I was also told that this test does better on 2MB pages than normal 4KB ones.

Jellied answered 2/9, 2020 at 12:10 Comment(8)

Can you provide some details about your machine. – Naarah 2/9, 2020 at 12:48

You are out of context. Contiguous virtual address space is not contiguous in physical address space. If you think allocating a single bulk of memory will reduce page faults and thus improve performance, then in systems, usually, results are counter intuitive. – Mcmurray 2/9, 2020 at 13:0

@TonyTannous Huge pages - if supported - are contiguos in physical memory – Naarah 2/9, 2020 at 13:21

See: https://mcmap.net/q/16591/-are-some-allocators-lazy – Earing 2/9, 2020 at 14:0

Shouldn't you be using MAP_POPULATE and MAP_LOCKED as well, unless you wanted to test the faulting performance specifically? Anyway, you should be able to use perf to see TLB, cache and other hardware counters. – Haemolysis 2/9, 2020 at 14:12

@TonyTannous as far as I know, one virtual page, if we are talking about memory mapping as in my case (but it could also be file mapping/devices/etc), corresponds to one physical page with exact size OR a continuous chunk of memory with that size. x86_64 ISA supports 2MB and 1GB pages: wiki.debian.org/Hugepages#x86_64. – Jellied 2/9, 2020 at 14:16

You are measuring wall clock time rather than CPU time. This may or may not be a good idea. You are also running just one test. This is certainly not a good idea. Run a loop of 100 and take an average. FWIW, I tried your test with 2MB pages and it is faster than with normal pages. – Kindliness 2/9, 2020 at 14:37

I confirm your observations, 1GB page random access is twice slower than 4kB pages on Skylake. Quite peculiar. – Auntie 2/9, 2020 at 15:56

Intel was kind enough to reply to this issue. See their answer below.

This issue is due to how physical pages are actually committed. In case of 1GB pages, the memory is contiguous. So, as soon as you write to any one byte within the 1GB page, the entire 1GB page is assigned. However, with 4KB pages, the physical pages get allocated as and when you touch for the first time in each of the 4KB pages.

for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
   for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
       index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
           ptr[index] = index * 5;
   }
}

In the innermost loop, the index changes at a stride of 512KB. So, consecutive references map at 512KB offsets. Typically caches have 2048 sets (which is 2^11). So, bits 6:16 select the sets. But if you stride at 512KB offsets, bits 6:16 would be the same ending up selecting the same set and losing the spatial locality.

We would recommend initializing the entire 1GB buffer sequentially (in the small page test) as below before starting the clock to time it

for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
    ptr[i] = i * 5;

Basically, the issue is with set conflicts resulting in cache misses in case of huge pages compared to small pages due to very large constant offsets. When you use constant offsets, the test is really not random.

Auntie answered 26/10, 2020 at 21:27 Comment(0)

Not an answer, but to provide more details to this perplexing issue.

Performance counters show roughly similar number of instructions, but roughly twice the number of cycles spent when huge pages are used:

4KiB pages IPC 0.29,
1GiB pages IPC 0.10.

These IPC numbers say that the code is bottlenecked on memory access (CPU bound IPC on Skylake is 3 and above). Huge pages bottleneck harder.

I modified your benchmark to use MAP_POPULATE | MAP_LOCKED | MAP_FIXED with fixed address 0x600000000000 for both cases to eliminate time variation associated with page faults and random mapping address. On my Skylake system 2MiB and 1GiB are more than 2x slower than 4kiB pages.

Compiled with g++-8.4.0 -std=gnu++14 -pthread -m{arch,tune}=skylake -O3 -DNDEBUG:

[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 2MB:64 --pool-pages-max 2MB:64
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1
[max@supernova:~/src/test] $ for s in small huge; do sudo chrt -f 40 taskset -c 7 perf stat -dd ./release/gcc/test $s random; done
Duration: 2156150

 Performance counter stats for './release/gcc/test small random':

       2291.190394      task-clock (msec)         #    1.000 CPUs utilized          
                 1      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                53      page-faults               #    0.023 K/sec                  
    11,448,252,551      cycles                    #    4.997 GHz                      (30.83%)
     3,268,573,978      instructions              #    0.29  insn per cycle           (38.55%)
       430,248,155      branches                  #  187.784 M/sec                    (38.55%)
           758,917      branch-misses             #    0.18% of all branches          (38.55%)
       224,593,751      L1-dcache-loads           #   98.025 M/sec                    (38.55%)
       561,979,341      L1-dcache-load-misses     #  250.22% of all L1-dcache hits    (38.44%)
       271,067,656      LLC-loads                 #  118.309 M/sec                    (30.73%)
           668,118      LLC-load-misses           #    0.25% of all LL-cache hits     (30.73%)
   <not supported>      L1-icache-loads                                             
           220,251      L1-icache-load-misses                                         (30.73%)
       286,864,314      dTLB-loads                #  125.203 M/sec                    (30.73%)
             6,314      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.73%)
                29      iTLB-loads                #    0.013 K/sec                    (30.73%)
             6,366      iTLB-load-misses          # 21951.72% of all iTLB cache hits  (30.73%)

       2.291300162 seconds time elapsed

Duration: 4349681

 Performance counter stats for './release/gcc/test huge random':

       4385.282466      task-clock (msec)         #    1.000 CPUs utilized          
                 1      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                53      page-faults               #    0.012 K/sec                  
    21,911,541,450      cycles                    #    4.997 GHz                      (30.70%)
     2,175,972,910      instructions              #    0.10  insn per cycle           (38.45%)
       274,356,392      branches                  #   62.563 M/sec                    (38.54%)
           560,941      branch-misses             #    0.20% of all branches          (38.63%)
         7,966,853      L1-dcache-loads           #    1.817 M/sec                    (38.70%)
       292,131,592      L1-dcache-load-misses     # 3666.84% of all L1-dcache hits    (38.65%)
            27,531      LLC-loads                 #    0.006 M/sec                    (30.81%)
            12,413      LLC-load-misses           #   45.09% of all LL-cache hits     (30.72%)
   <not supported>      L1-icache-loads                                             
           353,438      L1-icache-load-misses                                         (30.65%)
         7,252,590      dTLB-loads                #    1.654 M/sec                    (30.65%)
               440      dTLB-load-misses          #    0.01% of all dTLB cache hits   (30.65%)
               274      iTLB-loads                #    0.062 K/sec                    (30.65%)
             9,577      iTLB-load-misses          # 3495.26% of all iTLB cache hits   (30.65%)

       4.385392278 seconds time elapsed

Ran on Ubuntu 18.04.5 LTS with Intel i9-9900KS (which is not NUMA), 4x8GiB 4GHz CL17 RAM in all 4 slots, with performance governor for no CPU frequency scaling, liquid cooling fans on max for no thermal throttling, FIFO 40 priority for no preemption, on one specific CPU core for no CPU migration, multiple runs. The results are similar with clang++-8.0.0 compiler.

It feels like something is fishy in hardware, like a store buffer per page frame, so that 4KiB pages allow for ~2x more stores per unit of time.

Would be interesting to see results for AMD Ryzen 3 CPUs.

On AMD Ryzen 3 5950X the huge pages version is only up to 10% slower:

Duration: 1578723

 Performance counter stats for './release/gcc/test small random':

          1,726.89 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             1,947      page-faults               #    0.001 M/sec                  
     8,189,576,204      cycles                    #    4.742 GHz                      (33.02%)
         3,174,036      stalled-cycles-frontend   #    0.04% frontend cycles idle     (33.14%)
            95,950      stalled-cycles-backend    #    0.00% backend cycles idle      (33.25%)
     3,301,760,473      instructions              #    0.40  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (33.37%)
       480,276,481      branches                  #  278.116 M/sec                    (33.49%)
           864,075      branch-misses             #    0.18% of all branches          (33.59%)
       709,483,403      L1-dcache-loads           #  410.844 M/sec                    (33.59%)
     1,608,181,551      L1-dcache-load-misses     #  226.67% of all L1-dcache accesses  (33.59%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             
        78,963,441      L1-icache-loads           #   45.726 M/sec                    (33.59%)
            46,639      L1-icache-load-misses     #    0.06% of all L1-icache accesses  (33.51%)
       301,463,437      dTLB-loads                #  174.570 M/sec                    (33.39%)
       301,698,272      dTLB-load-misses          #  100.08% of all dTLB cache accesses  (33.28%)
                54      iTLB-loads                #    0.031 K/sec                    (33.16%)
             2,774      iTLB-load-misses          # 5137.04% of all iTLB cache accesses  (33.05%)
       243,732,886      L1-dcache-prefetches      #  141.140 M/sec                    (33.01%)
   <not supported>      L1-dcache-prefetch-misses                                   

       1.727052901 seconds time elapsed

       1.579089000 seconds user
       0.147914000 seconds sys

Duration: 1628512

 Performance counter stats for './release/gcc/test huge random':

          1,680.06 msec task-clock                #    1.000 CPUs utilized          
                 1      context-switches          #    0.001 K/sec                  
                 1      cpu-migrations            #    0.001 K/sec                  
             1,947      page-faults               #    0.001 M/sec                  
     8,037,708,678      cycles                    #    4.784 GHz                      (33.34%)
         4,684,831      stalled-cycles-frontend   #    0.06% frontend cycles idle     (33.34%)
         2,445,415      stalled-cycles-backend    #    0.03% backend cycles idle      (33.34%)
     2,217,699,442      instructions              #    0.28  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (33.34%)
       281,522,918      branches                  #  167.567 M/sec                    (33.34%)
           549,427      branch-misses             #    0.20% of all branches          (33.33%)
       312,930,677      L1-dcache-loads           #  186.261 M/sec                    (33.33%)
     1,614,505,314      L1-dcache-load-misses     #  515.93% of all L1-dcache accesses  (33.33%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             
           888,872      L1-icache-loads           #    0.529 M/sec                    (33.33%)
            13,140      L1-icache-load-misses     #    1.48% of all L1-icache accesses  (33.33%)
             9,168      dTLB-loads                #    0.005 M/sec                    (33.33%)
               870      dTLB-load-misses          #    9.49% of all dTLB cache accesses  (33.33%)
             1,173      iTLB-loads                #    0.698 K/sec                    (33.33%)
             1,914      iTLB-load-misses          #  163.17% of all iTLB cache accesses  (33.33%)
       253,307,275      L1-dcache-prefetches      #  150.772 M/sec                    (33.33%)
   <not supported>      L1-dcache-prefetch-misses                                   

       1.680230802 seconds time elapsed

       1.628170000 seconds user
       0.052005000 seconds sys

Auntie answered 4/9, 2020 at 17:3 Comment(10)

The huge test does have significantly more iTLB loads and misses along with more icache load misses. That seems strange. – Hair 4/9, 2020 at 17:17

@AndrewHenle Things are strange in these outputs indeed. L1-dcache-loads 6,758,085, but L1-dcache-load-misses 293,418,903, how to interpret that? Shouldn't L1-dcache-loads >= L1-dcache-load-misses? Or should it be L1-dcache-loads / (L1-dcache-loads + L1-dcache-load-misses)? perf doesn't think so with L1-dcache-load-misses/L1-dcache-loads == 4341.75%. – Auntie 4/9, 2020 at 17:26

I have no idea what's going on there. Skylakes are NUMA, aren't they? Maybe memory locality is causing the lower performance? – Hair 4/9, 2020 at 17:29

@AndrewHenle I run this on i9-9900KS, no NUMA. But you are probably right that miss rate is miss/(hit+miss) and perf doesn't display that right. – Auntie 4/9, 2020 at 17:30

I'm guessing it's safe to assume your system wasn't doing much else during the test, soooo? I'm stumped. This is weird. I've done a lot of huge pages work on older Sun hardware like the E25Ks, and I've never seen larger pages make things slower like this (just never let an Oracle databases and its need for large pages fight with the ZFS ARC and its demand for small pages for the last bits of free RAM...) – Hair 4/9, 2020 at 17:39

@AndrewHenle I use huge pages in production and they were benchmarked and showed better timings on production workloads on Xeons. But this simple benchmark shows something fundamentally misunderstood or broken with huge pages, on Skylake at least. And I do due diligence when benchmarking, like booting kernel in level 3 or s, setting performance governor, CPU fans to max, multiple runs with FIFO real-time priority. – Auntie 4/9, 2020 at 17:41

The 1GiB page run shows all loads and misses are fewer than those of 4KiB page run. Only iTLB-loads + iTLB-load-misses for 1GiB pages are larger than those for 4KiB pages. MAP_HUGETLB is supposed to minimize dTLB-load-misses, which it does. But surprising is that the huge page version does roughly 30% more iTLB loads. Could be the less frequent path of mmap with MAP_HUGETLB, but that probably doesn't explain more than 2x IPC difference. – Auntie 4/9, 2020 at 17:57

I completely agree with that. I wonder what the actual instruction timing is? I did find this: Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code Now I wish I had some new hardware to experiment with even if I don't have your experience with this kind of profiling on Intel hardware. All I have access to right now is pretty ancient. – Hair 4/9, 2020 at 18:5

@AndrewHenle Thank you, but my profiling experience is 99% looking at each and every number and appying common sense. The most primitive and widely supported CPU cycles counter can get you very far, no need for a latest CPUs with fancy counters. perf record -e cycles:uppp -c 10000 <app> followed by perf report -Mintel shows where CPU cycles are spent. If a load/store from/to memory shows up burning many cycles that means it bottlenecks on memory access (which is the case 99% of time) - no rocket science - only one basic CPU cycle counter is required to gain good insight. – Auntie 4/9, 2020 at 20:21

First, i think L1-dcache-loads are L1-dcache hits. Also, it's interesting how the small test produces many more memory accesses (L1-dcache-loads + L1-dcache-misses). It's probably because of the page table walker running much more frequently and making accesses while searching the tables. – Viewer 18/6, 2021 at 19:54

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags