How to know whether a copy-on-write page is an actual copy?
Asked Answered
B

6

8

When I create a copy-on-write mapping (a MAP_PRIVATE) using mmap, then some pages of this mapping will be copied as soon as I write to specific addresses. At a certain point in my program I would like to figure out which pages have actually been copied. There is a call, called 'mincore', but that only reports whether the page is in memory or not, which is not the same as the page being copied or not.

Is there some way to figure out which pages have been copied ?

Buttaro answered 18/12, 2010 at 10:27 Comment(0)
B
11

Good, following the advice of MarkR, I gave it a shot to go through the pagemap and kpageflags interface. Below a quick test to check whether a page is in memory 'SWAPBACKED' as it is called. One problem remains of course, which is the problem that kpageflags is only accessible to the root.

int main(int argc, char* argv[])
{
  unsigned long long pagesize=getpagesize();
  assert(pagesize>0);
  int pagecount=4;
  int filesize=pagesize*pagecount;
  int fd=open("test.dat", O_RDWR);
  if (fd<=0)
    {
      fd=open("test.dat", O_CREAT|O_RDWR,S_IRUSR|S_IWUSR);
      printf("Created test.dat testfile\n");
    }
  assert(fd);
  int err=ftruncate(fd,filesize);
  assert(!err);

  char* M=(char*)mmap(NULL, filesize, PROT_READ|PROT_WRITE, MAP_PRIVATE,fd,0);
  assert(M!=(char*)-1);
  assert(M);
  printf("Successfully create private mapping\n");

The test setup contains 4 pages. page 0 and 2 are dirty

  strcpy(M,"I feel so dirty\n");
  strcpy(M+pagesize*2,"Christ on crutches\n");

page 3 has been read from.

  char t=M[pagesize*3];

page 1 will not be accessed

The pagemap file maps the process its virtual memory to actual pages, which can then be retrieved from the global kpageflags file later on. Read the file /usr/src/linux/Documentation/vm/pagemap.txt

  int mapfd=open("/proc/self/pagemap",O_RDONLY);
  assert(mapfd>0);
  unsigned long long target=((unsigned long)(void*)M)/pagesize;
  err=lseek64(mapfd, target*8, SEEK_SET);
  assert(err==target*8);
  assert(sizeof(long long)==8);

Here we read the page frame numbers for each of our virtual pages

  unsigned long long page2pfn[pagecount];
  err=read(mapfd,page2pfn,sizeof(long long)*pagecount);
  if (err<0)
    perror("Reading pagemap");
  if(err!=pagecount*8)
    printf("Could only read %d bytes\n",err);

Now we are about to read for each virtual frame, the actual pageflags

  int pageflags=open("/proc/kpageflags",O_RDONLY);
  assert(pageflags>0);
  for(int i = 0 ; i < pagecount; i++)
    {
      unsigned long long v2a=page2pfn[i];
      printf("Page: %d, flag %llx\n",i,page2pfn[i]);

      if(v2a&0x8000000000000000LL) // Is the virtual page present ?
        {
        unsigned long long pfn=v2a&0x3fffffffffffffLL;
        err=lseek64(pageflags,pfn*8,SEEK_SET);
        assert(err==pfn*8);
        unsigned long long pf;
        err=read(pageflags,&pf,8);
        assert(err==8);
        printf("pageflags are %llx with SWAPBACKED: %d\n",pf,(pf>>14)&1);
        }
    }
}

All in all, I'm not particularly happy with this approach since it requires access to a file that we in general can't access and it is bloody complicated (how about a simple kernel call to retrieve the pageflags ?).

Buttaro answered 20/12, 2010 at 5:16 Comment(0)
O
3

I usually use mprotect to set my tracked copy-on-write pages to read-only, then handle the resulting SIGSEGVs by marking the given page dirty and enabling writing.

It isn't ideal, but the overhead is quite manageable and it can be used in combination with mincore, etc. to do more complicated optimizations, like manage your working set size or to approximate pointer information for pages you expect to have swap out, which lets the runtime system cooperate with the kernel rather than fight it.

Objectionable answered 5/5, 2011 at 2:44 Comment(2)
How do you handle and recover from SIGSEGV for this purpose?Infantile
@MattJoiner you can register a signal handler for it just like any other signal, and execution will resume fine after if the permissions on the page being accessed has been changed to have the correct permissions. This is because when execution resumes it retries the offending store/load. So call mprotect from your handler.Irritable
C
2

It is not easy, but possible to determine this. In order to find out whether a page is a copy of another page (possibly another process's) then you need to do the following (recentish kernels):

  1. Read the entry in /proc/pid/pagemap for the appropriate pages in the process(es)
  2. Interrogate /proc/kpageflags

You can then determine that two pages are actually the same page, in memory.

It is fairly tricky to do this, you need to be root, and whatever you do will probably have some race conditions in it, but it is possible.

Crisper answered 19/12, 2010 at 16:8 Comment(3)
That will tell you if two "clean" copy-on-write pages are sharing the same memory. Finding out whether a page created as copy-on-write is "clean" or "dirty" is much easier.Pup
I have kernel 2.6.34 running (probably not recent enough) and have neither kpageflags, nor the pagemap file. Also, I'm witing a user level program not something that should be running as root. In answer to Ben Voigt who claims that it is easier to find out whether a page is dirty or clean is easier: well.. how would I do that then ? Because that is exactly the question !Buttaro
According to the pagemap.txt documentation, 2.6.34 is new enough. It might not be necessary to use the (root-only) kpageflags interface to do what you want, in which case you won't need root.Crisper
P
2

Copy-on-write is implemented using the memory protection scheme of the virtual memory hardware.

When a read-only page is written to, a page fault occurs. The page fault handler checks if the page carries the copy-on-write flag: if so, a new page is allocated, the contents of the old page and copied, and the write is retried.

The new page is neither read-only nor copy-on-write, the link to the original page is completely broken.

So all you need to do is test the memory protection flags for the page.

On Windows, the API is GetWorkingSet, see the explanation at VirtualQueryEx. I don't know what the corresponding linux API is.

Pup answered 19/12, 2010 at 16:35 Comment(7)
I'm not using Windows as stated. I also never said the page would be marked 'copy-on-write', I just want to figure out those that are dirty. Whether there is no link to the original, I'm unsure because in Linux it is possible to invalidate the copied pages by syncronizing the read-only pages. (see msync).Buttaro
@Werner: msync is irrelevant, it only causes the writes that would occur during munmap to occur earlier. But writes to copy-on-write pages do not affect the file (see mmap man page, note concerning MAP_PRIVATE -- <quote>Stores to the region do not affect the original file.</quote>) And I know you said you are using linux, but I was hoping that by providing the name of the Windows API that someone would step in and provide a link to the equivalent linux function.Pup
According to this question, there's no function call for reading the page protection flags, but they are accessible via /proc/self/maps. If a page mapped with MAP_PRIVATE has the write permission flag, then it's already been written to (dirty) and the page fault handler has already copied the data to an independent private page.Pup
That is not correct. I checked the protection flags (in the maps file only) on my little example above and it was rw for the entire file. One need to go through the kpageflags file to figure it out.Buttaro
@Ben: In the manpage of msync: 'MS_INVALIDATE asks to invalidate other mappings of the same file', which for a MAP_PRIVATE map. Taken from pubs.opengroup.org/onlinepubs/009695399/functions/msync.html When MS_INVALIDATE is specified, msync() shall invalidate all cached copies of mapped data that are inconsistent with the permanent storage locations such that subsequent references shall obtain data that was consistent with the permanent storage locations sometime between the call to msync() and the first subsequent memory reference to the data.Buttaro
@Werner: You missed this earlier sentence: <quote>It is unspecified whether data in MAP_PRIVATE mappings has any permanent storage locations.</quote>Pup
It seems like the wine project wasn't able to implement GetWorkingSet, which means it's possible that there is no way for unprivileged code to get this information.Pup
S
2

I gave an answer to someone with a similar goal and referenced a question similar to yours.

I think bmargulies' answer to that question fits what you need perfectly when the two ideas are combined.

Stemma answered 27/1, 2011 at 0:37 Comment(0)
C
1

I don't recall such API being exported. Why do you want to do such a thing (What is the root of the problem you're solving?)

You might want to take a look at /proc/[pid]/smaps (which provides a somewhat detailed statistic of pages used/copied/stored).

Again, why would you want to do that? If you're sure this approach is the only one (usually, virtual memory is used and forgot about), you might want to consider writing a kernel module that handles such functionality.

Cumshaw answered 18/12, 2010 at 11:10 Comment(2)
I'm currently implementing a software transactional memory, which requires consistent read states (MAP_SHARED doesn't offer that) and an atomic write. msync might do the trick, but it doesn't guarantee that a written page is not flushed to file before calling msync. A solution I came up with wa to use a MAP_PRIVATE mmap and then flush the dirty pages myself to disk after which I would invalidate the own copy.Buttaro
Hello, the smaps file just counts the number of pages. E.g: Size: 112 kB Rss: 96 kB Pss: 1 kB Shared_Clean: 96 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 96 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kBButtaro

© 2022 - 2024 — McMap. All rights reserved.