I have an application where I need about 850 MB of continuous memory and be accessing it in a random manner. I was suggested to allocate a huge page of 1 GB, so that it would always be in TLB. I've written a demo with sequential/random accesses to measure the performance for small (4 KB in my case) vs large (1 GB) page:
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>
#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) // Aren't used in this example.
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
#define MESSINESS_LEVEL 512 // Poisons caches if LRU policy is used.
#define RUN_TESTS 25
void print_usage() {
printf("Usage: ./program small|huge1gb sequential|random\n");
}
int main(int argc, char *argv[]) {
if (argc != 3 && argc != 4) {
print_usage();
return -1;
}
uint64_t size = 1UL * 1024 * 1024 * 1024; // 1GB
uint32_t *ptr;
if (strcmp(argv[1], "small") == 0) {
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, // basically malloc(size);
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap small");
exit(1);
}
} else if (strcmp(argv[1], "huge1gb") == 0) {
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap huge1gb");
exit(1);
}
} else {
print_usage();
return -1;
}
clock_t start_time, end_time;
start_time = clock();
if (strcmp(argv[2], "sequential") == 0) {
for (int iter = 0; iter < RUN_TESTS; iter++) {
for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
ptr[i] = i * 5;
}
} else if (strcmp(argv[2], "random") == 0) {
// pseudorandom access pattern, defeats caches.
uint64_t index;
for (int iter = 0; iter < RUN_TESTS; iter++) {
for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
ptr[index] = index * 5;
}
}
}
} else {
print_usage();
return -1;
}
end_time = clock();
long double duration = (long double)(end_time - start_time) / CLOCKS_PER_SEC;
printf("Avr. Duration per test: %Lf\n", duration / RUN_TESTS);
// write(1, ptr, size); // Dumps memory content (1GB to stdout).
}
And on my machine (more below) the results are:
Sequential:
$ ./test small sequential
Avr. Duration per test: 0.562386
$ ./test huge1gb sequential <--- slightly better
Avr. Duration per test: 0.543532
Random:
$ ./test small random <--- better
Avr. Duration per test: 2.911480
$ ./test huge1gb random
Avr. Duration per test: 6.461034
I'm bothered with the random test, it seems that a 1GB page is 2 times slower!
I tried using madvise
with MADV_SEQUENTIAL
/ MADV_SEQUENTIAL
for respective tests, it didn't help.
Why does using a one huge page in case of random accesses degrades performance? What are the use-cases for huge pages (2MB and 1GB) in general?
I didn't test this code with 2MB pages, I think it should probably do better. I also suspect that since a 1GB page is stored in one memory bank it probably has something to do with multi-channels. But I would like to hear from you folks. Thanks.
Note: to run the test you must first enable 1GB pages in your kernel. You can do it by giving kernel this parameters hugepagesz=1G hugepages=1 default_hugepagesz=1G
. More: https://wiki.archlinux.org/index.php/Kernel_parameters. If enabled, you should get something like:
$ cat /proc/meminfo | grep Huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 1
HugePages_Free: 1
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 1048576 kB
EDIT1: My machine has Core i5 8600 and 4 memory banks 4 GB each. The CPU natively supports both 2MB and 1GB pages (it has pse
& pdpe1gb
flags, see: https://wiki.debian.org/Hugepages#x86_64). I was measuring machine time, not CPU time, I updated the code and the results now are average of 25 tests.
I was also told that this test does better on 2MB pages than normal 4KB ones.
MAP_POPULATE
andMAP_LOCKED
as well, unless you wanted to test the faulting performance specifically? Anyway, you should be able to useperf
to see TLB, cache and other hardware counters. – Haemolysis