Walking page tables of a process in Linux
Asked Answered
C

2

22

i'm trying to navigate the page tables for a process in linux. In a kernel module i realized the following function:

static struct page *walk_page_table(unsigned long addr)
{
    pgd_t *pgd;
    pte_t *ptep, pte;
    pud_t *pud;
    pmd_t *pmd;

    struct page *page = NULL;
    struct mm_struct *mm = current->mm;

    pgd = pgd_offset(mm, addr);
    if (pgd_none(*pgd) || pgd_bad(*pgd))
        goto out;
    printk(KERN_NOTICE "Valid pgd");

    pud = pud_offset(pgd, addr);
    if (pud_none(*pud) || pud_bad(*pud))
        goto out;
    printk(KERN_NOTICE "Valid pud");

    pmd = pmd_offset(pud, addr);
    if (pmd_none(*pmd) || pmd_bad(*pmd))
        goto out;
    printk(KERN_NOTICE "Valid pmd");

    ptep = pte_offset_map(pmd, addr);
    if (!ptep)
        goto out;
    pte = *ptep;

    page = pte_page(pte);
    if (page)
        printk(KERN_INFO "page frame struct is @ %p", page);

 out:
    return page;
}

This function is called from the ioctl and addr is a virtual address in process address space:

static int my_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long addr)
{
   struct page *page = walk_page_table(addr);
   ...
   return 0;
}

The strange thing is that calling ioctl in a user space process, this segfaults...but it seems that the way i'm looking for the page table entry is correct because with dmesg i obtain for example for each ioctl call:

[ 1721.437104] Valid pgd
[ 1721.437108] Valid pud
[ 1721.437108] Valid pmd
[ 1721.437110] page frame struct is @ c17d9b80

So why the process can't complete correcly the `ioctl' call? Maybe i have to lock something before navigating the page tables?

I'm working with kernel 2.6.35-22 and three levels page tables.

Thank you all!

Cloris answered 23/1, 2012 at 23:38 Comment(6)
Is it possible that ioctl syscall returns successfully and the code is segfaulting after that?Electrostriction
No because the ioctl syscall is the last instruction in main before return. If i comment ioctl the process doesn't segfault.Cloris
Why did you hide the part where you use the address of the struct page? Are you sure your segfaults does not come from here? Have you tried debugging this on qemu?Mendicant
After the call of walk-page_table i only do a printk if page is NULL. I tried also to keep only the call to walk_page_table but the process still segfaults. Maybe yes, the fastest way to discover the problem is debugging. Thank you Quentin.Cloris
Compile the code with debugging and force a stack trace during dumps so that you absolutely know what is happening. Or use kgdb. Also are you positively sure you're not using the new unlocked_ioctl feature of the recent kernels?Alameda
I never used kgdb. I'll will debug a UML kernel with gdb. However i'm not using unlocked_ioctl: kernel 2.6.35 still has ioctl function pointer in struct file_operations. Thanks sessyargc.jp!Cloris
G
11
pte_unmap(ptep); 

is missing just before the label out. Try to change the code in this way:

    ...
    page = pte_page(pte);
    if (page)
        printk(KERN_INFO "page frame struct is @ %p", page);

    pte_unmap(ptep); 

out:
Grimaldi answered 9/8, 2012 at 9:55 Comment(2)
Thank you. I was sure that kernel was compiled without CONFIG_HIGHPTE instead that define was set so pte_offset_map did a kmap.Cloris
Thanks! I kept getting a crash with messages like "<fn> returned with preemption imbalance". Finally traced the preempt_count() increment to pte_offset_map() ! Adding the pte_unmap decremented it and all well.Warrin
F
6

Look at /proc/<pid>/smaps filesystem, you can see the userspace memory:

cat smaps 
bfa60000-bfa81000 rw-p 00000000 00:00 0          [stack]
Size:                136 kB
Rss:                  44 kB

and how it is printed is via fs/proc/task_mmu.c (from kernel source):

http://lxr.linux.no/linux+v3.0.4/fs/proc/task_mmu.c

   if (vma->vm_mm && !is_vm_hugetlb_page(vma))
               walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
               show_map_vma(m, vma.....);
        seq_printf(m,
                   "Size:           %8lu kB\n"
                   "Rss:            %8lu kB\n"
                   "Pss:            %8lu kB\n"

And your function is somewhat like that of walk_page_range(). Looking into walk_page_range() you can see that the smaps_walk structure is not supposed to change while it is walking:

http://lxr.linux.no/linux+v3.0.4/mm/pagewalk.c#L153

For eg:

                }
 201                if (walk->pgd_entry)
 202                        err = walk->pgd_entry(pgd, addr, next, walk);
 203                if (!err &&
 204                    (walk->pud_entry || walk->pmd_entry || walk->pte_entry

If memory contents were to change, then all the above checking may get inconsistent.

All these just mean that you have to lock the mmap_sem when walking the page table:

   if (!down_read_trylock(&mm->mmap_sem)) {
            /*
             * Activate page so shrink_inactive_list is unlikely to unmap
             * its ptes while lock is dropped, so swapoff can make progress.
             */
            activate_page(page);
            unlock_page(page);
            down_read(&mm->mmap_sem);
            lock_page(page);
    }

and then followed by unlocking:

up_read(&mm->mmap_sem);

And of course, when you issue printk() of the pagetable inside your kernel module, the kernel module is running in the process context of your insmod process (just printk the "comm" and you can see "insmod") meaning the mmap_sem is lock, it also mean the process is NOT running, and thus there is no console output till the process is completed (all printk() output goes to memory only).

Sounds logical?

Farflung answered 31/1, 2012 at 11:34 Comment(2)
Thank you Peter, i tried to held mmap_sem before the first instruction that reads the page tables but doesn't work...the same Segmentation fault error. However when i call walk_page_table i'm not in the context of the process insmod: i call it inside my_ioctl so i'm in the context of process invoking ioctl syscall. Could this make a difference?Cloris
Yes, it makes a difference. Because different process have different per-process pagetable - if u walk the non-kernel part. But all process share the same pagetable when it comes to the kernel addresses.Farflung

© 2022 - 2024 — McMap. All rights reserved.