mmap: performance when using multithreading
Asked Answered
E

2

6

I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.

The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.

Some notes:

  • This happens only if I use mmap()/munmap(). I have tried to replace it with fopen()/fread() and it worked just fine. But since the mmap()/munmap() comes with 3rd party library, I would like to use it in its original form.
  • madvise() is called with MADV_SEQUENTIAL but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.
  • Thread affinity doesn't seem to matter. I have tried to limit each thread to specific core. I have also tried to limit threads to core pairs (Hyper Threading). No results so far.
  • Load reported by htop seems to be the same even in both cases.

So my questions are:

  • Is there anything about mmap() I am not aware of when used in multithreaded environment?
  • If so, why do 2 processes have better performance?

EDIT:

  • As pointed out in the comments, it is running on server with 2xCPU. I should probably try to set thread affinities such that it is always running on the same CPU but I think I already tried that and it didn't work.
  • Here is a piece of code with which I can reproduce the same issue as with my production software.
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>

#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

#ifndef WORKERS
#define WORKERS 16
#endif

bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;

std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
    int fd = open(file_path.data(), O_RDONLY);
    if (fd != -1)
    {
        auto dir_ent = std::filesystem::directory_entry{file_path.data()};
        if (dir_ent.is_regular_file())
        {
            auto size = dir_ent.file_size();
            auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
            madvise(data, size, MADV_SEQUENTIAL);
            close(fd);
            return { reinterpret_cast<const std::uint8_t*>(data), size };
        }

        close(fd);
    }

    return { nullptr, 0 };
}

void unmap_file(const std::uint8_t* data, std::size_t size)
{
    munmap((void*)data, size);
}

int main(int argc, char* argv[])
{
    std::deque<std::string> queue;

    std::vector<std::thread> threads;
    for (std::size_t i = 0; i < WORKERS; ++i)
    {
        threads.emplace_back(
            [&]() {
                std::string path;

                while (true)
                {
                    {
                        std::unique_lock<std::mutex> lock(queue_mutex);
                        while (!stop && queue.empty())
                            queue_cv.wait(lock);
                        if (stop && queue.empty())
                            return;
                        path = queue.front();
                        queue.pop_front();
                    }

                    auto [data, size] = map_file(path);
                    std::uint8_t b = 0;
                    for (auto itr = data; itr < data + size; ++itr)
                        b ^= *itr;
                    unmap_file(data, size);

                    std::cout << (int)b << std::endl;
                }
            }
        );
    }

    for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
    {
        std::unique_lock<std::mutex> lock(queue_mutex);
        if (p.is_regular_file())
        {
            queue.push_back(p.path().native());
            queue_cv.notify_one();
        }
    }

    stop = true;
    queue_cv.notify_all();

    for (auto& t : threads)
        t.join();

    return 0;
}

Essary answered 3/5, 2019 at 7:51 Comment(5)
why do 2 processes have better performance - could be unintended interaction between threads or false sharing. Post the code.Cognoscenti
I'll try to come up with some minimal example then. I just thought that I missed something about mmap() and threading but if it's not that simple to pinpoint the cause I'll come up with something.Hungry
On a multi-socket machine it could also be that memory is allocated on a remote NUMA node instead of local. See output of numactl --hardware.Cognoscenti
It is indeed server with 2xCPU if that changes anything. I'll post minimal example as soon as I can.Hungry
I've added a code with which I am able to reproduce the issue.Hungry
H
7

Is there anything about mmap() I am not aware of when used in multithreaded environment?

Yes. mmap() requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:

... playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap:

  • quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.

  • page faulting is expensive. That's how the mapping gets populated, and it's quite slow.

Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.

So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.

You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.

Reusing the buffer also completely eliminates any munmap() or delete/free().

You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?

As far as

If so, why do 2 processes have better performance?

The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.

Helyn answered 3/5, 2019 at 13:45 Comment(4)
There have been optimizations to mmap since that post from 2000, however, manipulating vm_area_struct, page mappings and TLB are still quite expensive and single-threaded, as you said.Cognoscenti
@MaximEgorushkin Agreed, but that's the best authoritative source I'm aware of that clearly and concisely summarizes the very real costs of mmap(). Using mmap() on a file that is only read once is probably about the worst possible use-case for mmap(), at least performance-wise. I personally view mmap() as mostly a function that can greatly simplify code with a small but actual performance cost.Helyn
Agreed. IMO, the best case for mmap is mapping a large file with MAP_POPULATE or MAP_LOCKED flag to minimize the page faults and probably MAP_HUGETLB. +1.Cognoscenti
Thank you, that makes sense. I know that mmap() is probably not the best thing to use here but I am using 3rd party library which does mmap() so the minimal example above is just simplified version of the library. I'll probably end up modifying the library myself though because everything is fine when I start reading the file instead of mapping it.Hungry
C
4

A few notes:

  1. Try running your application with perf stat -ddd <app> and have a look at context-switches, cpu-migrations and page-faults numbers.
  2. The threads probably contend for vm_area_struct in the kernel process structure on mmap and page faults. Try passing MAP_POPULATE or MAP_LOCKED flag into mmap to minimize page faults. Alternatively, try mmap with MAP_POPULATE or MAP_LOCKED flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).
  3. You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags.
  4. Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
  5. Lock the mutex before stop = true, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.
  6. p.is_regular_file() check doesn't require the mutex to be locked.
  7. std::deque can be replaced with std::list and use splice to push and pop elements to minimize the time the mutex is locked.
Cognoscenti answered 3/5, 2019 at 12:42 Comment(1)
Thank you. I have already started experimenting with the flags you have mentioned. I have also found set_mempolicy which is basically numactl. No results so far but at least I am no longer stuck. Will get back to you as I try all the options.Hungry

© 2022 - 2024 — McMap. All rights reserved.