Using move_pages() to move hugepages?
Asked Answered
U

1

1

This question is for:

  1. kernel 3.10.0-1062.4.3.el7.x86_64
  2. non transparent hugepages allocated via boot parameters and might or might not be mapped to a file (e.g. mounted hugepages)
  3. x86_64

According to this kernel source, move_pages() will call do_pages_move() to move a page, but I don't see how it indirectly calls migrate_huge_page().

So my questions are:

  1. can move_pages() move hugepages? if yes, should the page boundary be 4KB or 2MB when passing an array of addresses of pages? It seems like there was a patch for supporting moving hugepages 5 years ago.
  2. if move_pages() cannot move hugepages, how can I move hugepages?
  3. after moving hugepages, can I query the NUMA IDs of hugepages the same way I query regular pages like this answer?

According to the code below, it seems like I move hugepages via move_pages() with page size = 2MB but is it the correct way?:

#include <cstdint>
#include <iostream>
#include <numaif.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <limits>

int main(int argc, char** argv) {
        const int32_t dst_node = strtoul(argv[1], nullptr, 10);
        const constexpr uint64_t size = 4lu * 1024 * 1024;
        const constexpr uint64_t pageSize = 2lu * 1024 * 1024;
        const constexpr uint32_t nPages = size / pageSize;
        int32_t status[nPages];
        std::fill_n(status, nPages, std::numeric_limits<int32_t>::min());;
        void* pages[nPages];
        int32_t dst_nodes[nPages];
        void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);

        if (ptr == MAP_FAILED) {
                throw "failed to map hugepages";
        }
        memset(ptr, 0x41, nPages*pageSize);
        for (uint32_t i = 0; i < nPages; i++) {
                pages[i] = &((char*)ptr)[i*pageSize];
                dst_nodes[i] = dst_node;
        }

        std::cout << "Before moving" << std::endl;

        if (0 != move_pages(0, nPages, pages, nullptr, status, 0)) {
            std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
        }
        else {
                for (uint32_t i = 0; i < nPages; i++) {
                        std::cout << "page # " << i << " locates at numa node " << status[i] << std::endl;
                }
        }

        // real move
        if (0 != move_pages(0, nPages, pages, dst_nodes, status, MPOL_MF_MOVE_ALL)) {
                std::cout << "failed to move pages because " << strerror(errno) << std::endl;
                exit(-1);
        }

        const constexpr uint64_t smallPageSize = 4lu * 1024;
        const constexpr uint32_t nSmallPages = size / smallPageSize;
        void* smallPages[nSmallPages];
        int32_t smallStatus[nSmallPages] = {std::numeric_limits<int32_t>::min()};
        for (uint32_t i = 0; i < nSmallPages; i++) {
                smallPages[i] = &((char*)ptr)[i*smallPageSize];
        }


        std::cout << "after moving" << std::endl;
        if (0 != move_pages(0, nSmallPages, smallPages, nullptr, smallStatus, 0)) {
            std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
        }
        else {
                for (uint32_t i = 0; i < nSmallPages; i++) {
                        std::cout << "page # " << i << " locates at numa node " << smallStatus[i] << std::endl;
                }
        }

}

And should I query the NUMA IDs based on 4KB page size (like the code above)? Or 2MB?

Unrepair answered 14/1, 2020 at 1:8 Comment(4)
The question is bit specific for linux kernel version; architecture (x86_64), and the method of hugepage allocation (hugetlbfs may have another answer; kernel.org/doc/Documentation/vm/hugetlbpage.txt and libnuma should be checked). Please include relevant info into your question. For question 3 you can use small page address in "pages" array to query NUMA status of memory, it will work with THP (kernel.org/doc/Documentation/vm/transhuge.txt). I expect THP also work ok for move, I can't exclude but its decay into small pages during the process.Wilen
"I don't see how it indirectly calls migrate_huge_page()." - yes, as 3.10 version move_pages does not call migrate_huge_page; it was only for soft offline, as mentioned in linked patch lwn.net/Articles/544044 "Hugepage migration is now available only for soft offlining". But there are another huge-related functions in the mm/migrate.c file like migrate_huge_page_move_mapping, migrate_page_copy, copy_huge_page, unmap_and_move_huge_page.Wilen
@Wilen updated the question to specify kernel version, arch and how hugepages are allocated.Unrepair
@Wilen so can I move hugepages via move_pages()? or I have to use those hugepage specific functions to move?Unrepair
W
1

For original version of 3.10 linux kernel (not redhat patched, as I have no LXR for rhel kernels) syscall move_pages will force splitting huge page (2MB; both THP and hugetlbfs styles) into small pages (4KB). move_pages uses too short chunks (around 0.5MB if I calculated correctly) and the function graph is like:

move_pages .. -> migrate_pages -> unmap_and_move ->

static int unmap_and_move(new_page_t get_new_page, unsigned long private,
            struct page *page, int force, enum migrate_mode mode)
{
    struct page *newpage = get_new_page(page, private, &result);
    ....
    if (unlikely(PageTransHuge(page)))
        if (unlikely(split_huge_page(page)))
            goto out;

PageTransHuge returns true for both kinds of hugepages (thp and libhugetlbs): https://elixir.bootlin.com/linux/v3.10/source/include/linux/page-flags.h#L411

PageTransHuge() returns true for both transparent huge and hugetlbfs pages, but not normal pages.

And split_huge_page will call split_huge_page_to_list which:

Split a hugepage into normal pages. This doesn't change the position of head page.

Split will also emit vm_event counter increment of kind THP_SPLIT. The counters are exported in /proc/vmstat ("file displays various virtual memory statistics"). You can check this counter with this UUOC command cat /proc/vmstat |grep thp_split before and after your test.

There were some code for hugepage migration in 3.10 version as unmap_and_move_huge_page function which is not called from move_pages. The only usage of it in 3.10 was in migrate_huge_page which is called only from memory failure handler soft_offline_huge_page (__soft_offline_page) (added 2010):

Soft offline a page, by migration or invalidation, without killing anything. This is for the case when a page is not corrupted yet (so it's still valid to access), but has had a number of corrected errors and is better taken out.

Answers:

can move_pages() move hugepages? if yes, should the page boundary be 4KB or 2MB when passing an array of addresses of pages? It seems like there was a patch for supporting moving hugepages 5 years ago.

Standard 3.10 kernel have move_pages which will accept array "pages" of 4KB page pointers and it will break (split) huge page into 512 small pages and then it will migrate small pages. There are very low chances for them to be merged back by thp as move_pages does separate requests for physical memory pages and they almost always will be non-continuous.

Don't give pointers to "2MB", it will still split every huge page mentioned and migrate only first 4KB small page of this memory.

2013 patch was not added into original 3.10 kernel.

The patch seems to be accepted in September 2013: https://github.com/torvalds/linux/search?q=+extend+hugepage+migration&type=Commits

if move_pages() cannot move hugepages, how can I move hugepages?

move_pages will move data from hugepages as small pages. You can: allocate huge page in manual mode at correct numa node and copy your data (copy twice if you want to keep virtual address); or update kernel to some version with the patch and use methods and tests of patch author, Naoya Horiguchi (JP). There is copy of his tests: https://github.com/srikanth007m/test_hugepage_migration_extension (https://github.com/Naoya-Horiguchi/test_core is required)

https://github.com/srikanth007m/test_hugepage_migration_extension/blob/master/test_move_pages.c

Now I'm not sure how to start the test and how to check that it works correctly. For ./test_move_pages -v -m private -h 2048 runs with recent kernel it does not increment THP_SPLIT counter.

His test looks very similar to our tests: mmap, memset to fault pages, filling pages array with pointers to small pages, numa_move_pages

after moving hugepages, can I query the NUMA IDs of hugepages the same way I query regular pages like this answer?

You can query status of any memory by providing correct array "pages" to move_pages syscall in query mode (with null nodes). Array should list every small page of the memory region you want to check.

If you know any reliable method to check if the memory mapped to huge page or not, you can query any small page of huge page. I think that there can be probabilistic method if you can export physical address out of kernel to the user-space (using some LKM module for example): for huge page virtual and physical addresses will always have 21 common LSB bits, and for small pages bits will coincide only for 1 test in million. Or just write LKM to export PMD Directory.

Wilen answered 14/1, 2020 at 8:53 Comment(13)
Just checked the version linux-3.10.0-1062.4.3.el7 - mm/migrate.c has the same code in unmap_and_move with PageTransHuge(page) & split_huge_page(page) as original 3.10 kernel.Wilen
thanks for your detailed explanation. You mentioned, "Don't give pointers to "2MB", it will still split every huge page mentioned and migrate only first 4KB small page of this memory". But the strange thing is that it seems like it migrates 512 * 4K small pages in my example. My example gives an array of pointers to 2MB hugepages.Unrepair
you also mentioned, "There are very low chances for them to be merged back by thp as move_pages does separate requests for physical memory pages and they almost always will be non-continuous". So move_pages() does attempt to merge 512 small pages back to a single 2MB hugepage?Unrepair
I am using kernel 3.10.0-1062.4.3.el7.x86_64 from CentOS 7's official repository.Unrepair
I just downloaded the source. And I see the patches you mentioned. So it explains why even my example uses an array of pointers to 2MB hugepages, all hugepages are moved correctly and thp_split counter doesn't increase.Unrepair
How can I "allocate huge page in manual mode at correct numa node"? That would save me from calling move_pages() in some cases. Thanks.Unrepair
HCSF: to allocate huge on node you can try: numa_alloc_onnode from libnuma (man7.org/linux/man-pages/man3/numa.3.html) with size aligned and length of 2MB to allocate THP huge pages (dealloc only with numa_free()); or you can try mmap huge page in usual way, and call numa_tonode_memory to change memory region policy, then fault it. Not sure which will work. I did check of the 3.10.0-1062 version and I was not able to find the patches lore.kernel.org/patchwork/cover/395020Wilen
I believe that the patches were backported to CentOS 7's 3.10 kernel only but not the original version of 3.10.Unrepair
So if I have preallocated hugepages (e.g. allocated via boot parameters). Using numa_alloc_onnode() will grab from the preallocated hugepages? Or it will allocate a new one if possible? The doc doesn't mention it....guess I can test out too. Wish the doc could be more detailed.Unrepair
which parameter did you use to preallocate hugepages? I think preallocation is needed for libhugetlbfs-style huge page, not for THP style pages. There are another problems with hugetlbfs huge pages (persistent huge page) and numa - kernel.org/doc/Documentation/vm/hugetlbpage.txt "Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt] persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified.". Other kind is THP = transhuge kernel.org/doc/Documentation/vm/transhuge.txtWilen
I am using hugepages parameter in boot parameters to preallocate hugepages at boot time. And I believe it is the (lib)hugetlbfs-style huagepages you referred to. And yes, the hugepages are preallocated across different numa nodes. So I guess numa_alloc_onnode() will just grab free hugepages from the local NUMA first if possible, if fails, then foreign, and if fails, it will actually allocate?Unrepair
numa_alloc_onnode will not allocate hugetlbfs page, it may allocate THP: github.com/jmesmon/numactl/blob/master/libnuma.c#L974 - mmap in numa_alloc_* has no MAP_HUGETLB flag. To enforce numa it is using mbind syscall.Wilen
Thanks for pointing to the source code. You are right. I guess I have to stick with move_pages() and hugetlbfs then. Really appreciate all your help (and all the links you offer. Many new things to learn there).Unrepair

© 2022 - 2024 — McMap. All rights reserved.