Allocating copy on write memory within a process
Asked Answered
R

2

43

I have a memory segment which was obtained via mmap with MAP_ANONYMOUS.

How can I allocate a second memory segment of the same size which references the first one and make both copy-on write in Linux (Working Linux 2.6.36 at the moment)?

I want to have exactly the same effect as fork, just without creating a new process. I want the new mapping to stay in the same process.

The whole process has to be repeatable on both the origin and copy pages (as if parent and child would continue to fork).

The reason why I don't want to allocate a straight copy of the whole segment is because they are multiple gigabytes large and I don't want to use memory which could be copy-on-write shared.

What I have tried:

mmap the segment shared, anonymous. On duplication mprotect it to read-only and create a second mapping with remap_file_pages also read-only.

Then use libsigsegv to intercept write attempts, manually make a copy of the page and then mprotect both to read-write.

Does the trick, but is very dirty. I am essentially implementing my own VM.

Sadly mmaping /proc/self/mem is not supported on current Linux, otherwise a MAP_PRIVATE mapping there could do the trick.

Copy-on-write mechanics are part of the Linux VM, there has to be a way to make use of them without creating a new process.

As a note: I have found the appropriate mechanics in the Mach VM.

The following code compiles on my OS X 10.7.5 and has the expected behaviour: Darwin 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64 i386

gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#ifdef __MACH__
#include <mach/mach.h>
#endif


int main() {

    mach_port_t this_task = mach_task_self();

    struct {
        size_t rss;
        size_t vms;
        void * a1;
        void * a2;
        char p1;
        char p2;
        } results[3];

    size_t length = sysconf(_SC_PAGE_SIZE);
    vm_address_t first_address;
    kern_return_t result = vm_allocate(this_task, &first_address, length, VM_FLAGS_ANYWHERE);

    if ( result != ERR_SUCCESS ) {
        fprintf(stderr, "Error allocating initial 0x%zu memory.\n", length);
           return -1;
    }

    char * first_address_p = first_address;
    char * mirror_address_p;
    *first_address_p = 'a';

    struct task_basic_info t_info;
    mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;

    task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);

    task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);
    results[0].rss = t_info.resident_size;
    results[0].vms = t_info.virtual_size;
    results[0].a1 = first_address_p;
    results[0].p1 = *first_address_p;

    vm_address_t mirrorAddress;
    vm_prot_t cur_prot, max_prot;
    result = vm_remap(this_task,
                      &mirrorAddress,   // mirror target
                      length,    // size of mirror
                      0,                 // auto alignment
                      1,                 // remap anywhere
                      this_task,  // same task
                      first_address,     // mirror source
                      1,                 // Copy
                      &cur_prot,         // unused protection struct
                      &max_prot,         // unused protection struct
                      VM_INHERIT_COPY);

    if ( result != ERR_SUCCESS ) {
        perror("vm_remap");
        fprintf(stderr, "Error remapping pages.\n");
              return -1;
    }

    mirror_address_p = mirrorAddress;

    task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);
    results[1].rss = t_info.resident_size;
    results[1].vms = t_info.virtual_size;
    results[1].a1 = first_address_p;
    results[1].p1 = *first_address_p;
    results[1].a2 = mirror_address_p;
    results[1].p2 = *mirror_address_p;

    *mirror_address_p = 'b';

    task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);
    results[2].rss = t_info.resident_size;
    results[2].vms = t_info.virtual_size;
    results[2].a1 = first_address_p;
    results[2].p1 = *first_address_p;
    results[2].a2 = mirror_address_p;
    results[2].p2 = *mirror_address_p;

    printf("Allocated one page of memory and wrote to it.\n");
    printf("*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[0].a1, results[0].p1, results[0].rss, results[0].vms);
    printf("Cloned that page copy-on-write.\n");
    printf("*%p = '%c'\n*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[1].a1, results[1].p1,results[1].a2, results[1].p2, results[1].rss, results[1].vms);
    printf("Wrote to the new cloned page.\n");
    printf("*%p = '%c'\n*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[2].a1, results[2].p1,results[2].a2, results[2].p2, results[2].rss, results[2].vms);

    return 0;
}

I want the same effect in Linux.

Raymer answered 6/6, 2013 at 14:59 Comment(8)
You could use btrfs and use its file duplication with copy-on-write feature... however, you'd then have unnecessary copies of your data in the FS. Should work, but not exactly high-performance.Inflation
Is patching the kernel out of the question?Inflation
@Inflation Unfortunately it is :(. The code is intended to be deployable on machines I don't have root on. Deploying another file system isn't an option either for the same reason and performance./dev/shm (tmpfs) is as far as I am willing to go with file-backed memory.Raymer
How exactly is copy-on-write supposed to look to the client code, when the clients of both copies have to share the same address space? Are you going to have one (or both) actually move to new virtual addresses?Nekton
@ChrisStratton The new copy mapping can be placed anywhere into my virtual address space and return a pointer. The origin mapping should stay where it is. Please check the vm_remap call in the mach code. This is exactly the semantics that I want - just in Linux.Raymer
Possible duplicate: Can I do a copy on write memcpy in linux.Surd
@artlessnoise The answers there are also not repeatable on the same mapping. Also I want to be able to do this ideally on page-scale.Raymer
Also possibly related: Get the copy-on-write behaviour of fork()ing, without fork().Fishplate
A
9

I tried to achieve the same thing (in fact, its sightly simpler as I only need to take snapshots of a live region, I do not need to take copies of the copies). I did not find a good solution for this.

Direct kernel support (or the lack thereof): By modifying/adding a module it should be possible to achieve this. However there is no simple way to setup a new COW region from an existing one. The code used by fork (copy_page_rank) copy a vm_area_struct from one process/virtual address space to another (new one) but assumes that the address of the new mapping is the same as the address of the old one. If one want to implement a "remap" feature, the function must be modified/duplicated in order to copy a vm_area_struct with address translation.

BTRFS: I thought of using COW on btrfs for this. I wrote a simple program mapping two reflink-ed files and tried to map them. However, looking at the page information with /proc/self/pagemap shows the two instances of the file do not share the same cache pages. (At least unless my test is wrong). So you will not gain much by doing this. The physical pages of the same data will not be shared among different instances.

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <inttypes.h>
#include <stdio.h>

void* map_file(const char* file) {
  struct stat file_stat;
  int fd = open(file, O_RDWR);
  assert(fd>=0);
  int temp = fstat(fd, &file_stat);
  assert(temp==0);
  void* res = mmap(NULL, file_stat.st_size, PROT_READ, MAP_SHARED, fd, 0);
  assert(res!=MAP_FAILED);
  close(fd);
  return res;
}

static int pagemap_fd = -1;

uint64_t pagemap_info(void* p) {
  if(pagemap_fd<0) {
    pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
    if(pagemap_fd<0) {
      perror("open pagemap");
      exit(1);
    }
  }
  size_t page = ((uintptr_t) p) / getpagesize();
  int temp = lseek(pagemap_fd, page*sizeof(uint64_t), SEEK_SET);
  if(temp==(off_t) -1) {
    perror("lseek");
    exit(1);
  }
  uint64_t value;
  temp = read(pagemap_fd, (char*)&value, sizeof(uint64_t));
  if(temp<0) {
    perror("lseek");
    exit(1);
  }
  if(temp!=sizeof(uint64_t)) {
    exit(1);
  }
  return value;
}

int main(int argc, char** argv) {
 
  char* a = (char*) map_file(argv[1]);
  char* b = (char*) map_file(argv[2]);
  
  int fd = open("/proc/self/pagemap", O_RDONLY);
  assert(fd>=0);

  int x = a[0];  
  uint64_t info1 = pagemap_info(a);

  int y = b[0];
  uint64_t info2 = pagemap_info(b);

  fprintf(stderr, "%" PRIx64 " %" PRIx64 "\n", info1, info2);

  assert(info1==info2);

  return 0;
}

mprotect+mmap anonymous pages: It does not work in your case, but a solution is to use a MAP_SHARED file for my main memory region. On a snapshot, the file is mapped somewhere else and both instances are mprotected. On a write, a anonymous page in mapped in the snapshot, the data is copied in this new page and the original page is unprotected. However this solution does not work in your case as you will not be able to repeat the process in the snapshot (because it is not a plain MAP_SHARED area but a MAP_SHARED with some MAP_ANONYMOUS pages. Moreover it does not scale with the number of copies : if I have many COW copies, I will have to repeat the same process for each copy and this page will not be duplicated for the copies. And I can't map the anonymous page in the original area as it will not be possible to map the anonymous pages in the copies. This solution does not work in anyway.

mprotect+remap_file_pages: This looks like the only way do do this without touching the Linux kernel. The downside it that, in general, you will probably have to make a remap_file_page syscall for each page when doing a copy : it might not be that efficient to make a lot of syscalls. When deduplicating a shared page, you need at least to : remap_file_page a new/free page for the new written-to-page, m-un-protect the new page. It is necessary to reference count each page.

I do not think that the mprotect() based approaches would scale very well (if you handle a lot of memory like this). On Linux, mprotect() does not work at the memory page granularity but at the vm_area_struct granularity (the entries you find in /prod//maps). Doing a mprotect() at the memory page granularity will cause the kernel to constantly split and merge vm_area_struct:

  • you will end up with a very mm_struct;

  • looking up a vm_area_struct (which is used for a log of virtual memory related operations) is on O(log #vm_area_struct) but it might still have a negative performance impact;

  • memory consumption for those structures.

For this kind of reason, the remap_file_pages() syscall was created [http://lwn.net/Articles/24468/] in order to do non-linear memory mapping of a file. Doing this with mmap, requires a log of vm_area_struct. I don not event think that they this was designed for page granularity mapping: the remap_file_pages() is not very optimised for this use case as it would need a syscall per page.

I think the only viable solution is to let the kernel do it. It is possible to do it in userspace with remap_file_pages but it will probably be quite inefficient as a snapshot will in generate need a number of syscalls proportional in the number of pages. A variant of remap_file_pages might do the trick.

This approach however duplicate the page logic of the kernel. I tend to think we should let the kernel do this. All in all, an implementation in the kernel seems to be the better solution. For someone who knows this part of the kernel, it should be quite easy to do.

KSM (Kernel Samepage Merging): There is a thing that the kernel can do. It can try to deduplicate the pages. You will still have to copy the data, but the kernel should be able to merge them. You need to mmap a new anonymous area for your copy, copy it manually with memcpy and madvide(start, end, MADV_MERGEABLE) the areas. You need to enable KSM (in root):

echo 1 > /sys/kernel/mm/ksm/run
echo 10000 > /sys/kernel/mm/ksm/pages_to_scan

It works, it doesn't work so well with my workload but it's probably because the pages are not shared a lot in the end. The downside is that you still have to do the copy (you cannot have an efficient COW) and then the kernel will un-merge the page. It will generate page and cache faults when doing the copies, the KSM daemon thread will consume a lot of CPU (I have a CPU running at A00% for the whole simulation) and probably consume a log a cache. So you will not gain time when doing the copy but you might gain some memory. If your main motivation, is to use less memory in the long run and you do not care that much about avoiding the copies, this solution might work for you.

Argue answered 14/4, 2014 at 8:29 Comment(3)
You have a lot of nice ideas, sadly none fulfil sufficient requirements for my purpose. I have already discussed mprotect+mmap anonymous pages and mprotect+remap_file_pages in my question. I have not looked into BRTFS, so may check it out. KSM is sadly not an option because that relies on me creating copies in the first place and I want to avoid making those. I have even looked at patching the Linux kernel myself, but never found the time to do it. +1 for some good ideas.Raymer
For reference, remap_file_pages is now deprecated and will probably be removed/replaced by a slow emulation.Argue
What is the recommended amount of coffee for someone seriously considering messing with the Kernel. I'm asking for a friend...Hedger
I
2

Hmm... you could create a file in /dev/shm with MAP_SHARED, write to it, then reopen it twice with MAP_PRIVATE.

Inflation answered 6/6, 2013 at 15:9 Comment(4)
You mean reopen it with MAP_PRIVATE. Yes, this works. Once. I need to be able to repeat the process, duplicating and reduplicating pages.Raymer
What would be the error code/message in that case? From my experience you can mmap a file as often as you want with MAP_PRIVATE.Erosive
@DavidFoerster: He means that e.g. after creating a copy-on-write copy of A called B and making some changes in B, he wants to create a copy-on-write copy of B. That isn't possible with this method.Inflation
@DavidFoerster you can not write a dirty MAP_PRIVATE page back to a file and reopen it because no file descriptor is attached to it.Raymer

© 2022 - 2024 — McMap. All rights reserved.