What happens when a mov instruction causes a page fault with interrupts disabled on x86?

Asked 26/9, 2012 at 17:35 Answered 27/9, 2012 at 20:53

Solved linux-kernel x86 linux-device-driver interrupt page-fault

I recently encountered an issue in a custom Linux kernel (2.6.31.5, x86) driver where copy_to_user would periodically not copy any bytes to user space. It would return the count of bytes passed to it, indicating that it had not copied anything. After code inspection we found that the code was disabling interrupts while calling copy_to_user which violates it's contract. After correcting this, the issue stopped occurring. Because the issue happened so infrequently, I need to prove that disabling the interrupts caused the issue.

If you look at the code snippet below from arch/x86/lib/usercopy_32.c rep; movsl copies the words to userspace by the count in CX. Size is updated with CX on exit. CX will be 0 if the movsl execute correctly. Because CX is not zero, the movs? instructions must not have executed, in order to fit the definition of copy_to_user and the observed behavior.

/* Generic arbitrary sized copy.  */
#define __copy_user(to, from, size)                 \
do {                                    \
    int __d0, __d1, __d2;                       \
    __asm__ __volatile__(                       \
        "   cmp  $7,%0\n"                   \
        "   jbe  1f\n"                  \
        "   movl %1,%0\n"                   \
        "   negl %0\n"                  \
        "   andl $7,%0\n"                   \
        "   subl %0,%3\n"                   \
        "4: rep; movsb\n"                   \
        "   movl %3,%0\n"                   \
        "   shrl $2,%0\n"                   \
        "   andl $3,%3\n"                   \
        "   .align 2,0x90\n"                \
        "0: rep; movsl\n"                   \
        "   movl %3,%0\n"                   \
        "1: rep; movsb\n"                   \
        "2:\n"                          \
        ".section .fixup,\"ax\"\n"              \
        "5: addl %3,%0\n"                   \
        "   jmp 2b\n"                   \
        "3: lea 0(%3,%0,4),%0\n"                \
        "   jmp 2b\n"                   \
        ".previous\n"                       \
        ".section __ex_table,\"a\"\n"               \
        "   .align 4\n"                 \
        "   .long 4b,5b\n"                  \
        "   .long 0b,3b\n"                  \
        "   .long 1b,2b\n"                  \
        ".previous"                     \
        : "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2)   \
        : "3"(size), "0"(size), "1"(to), "2"(from)      \
        : "memory");                        \
} while (0)

The 2 ideas that I have are:

when the interrupts are disabled, the page fault does not occur and then rep; movs? is skipped without doing anything. The return value would then be CX, or the amount not copied to userspace, as the definition specifies and the behavior observed.
The page fault does occur, but linux can not process it because interrupts are disabled, so the page fault handler skips the instruction, although I don't know how the page fault handler would do this. Again, in this case CX would remain unmodified and the return value would be correct.

Can anyone point me to the sections in the Intel manuals that specify this behavior, or point me to any additional Linux source that could be helpful?

Atween answered 26/9, 2012 at 17:35 Comment(3)

you mention that "the code was disabling interrupts". Can you elaborate which interrupts and how?... – Flesher 27/9, 2012 at 11:52

@TheCodeArtist: write_lock_bh(); was held, which by my understanding disables software interrupts. – Atween 27/9, 2012 at 19:40

@TheCodeArtist: Thanks! your comment made me look into write_lock_bh() much more closely, showing me the way! – Atween 27/9, 2012 at 20:54

I've found the answer. My #2 suggestion was correct and the mechanism was right in front of my face. The page fault does happen, but the fixup_exception mechanism is used to provide a exception/continue mechanism. This section adds entries to the exception handler table:

    ".section __ex_table,\"a\"\n"               \
    "   .align 4\n"                 \
    "   .long 4b,5b\n"                  \
    "   .long 0b,3b\n"                  \
    "   .long 1b,6b\n"                  \
    ".previous"                     \

This says: if the IP address is the first entry and an exception is encountered in a fault handler, then set the IP address to the second address and continue.

So if the exception happens at "4:", jump to "5:". If the exception happens at "0:" then jump to "3:" and if the exception happens at "1:" jump to "6:".

The missing piece is in do_page_fault() in arch/x86/mm/fault.c:

/*
 * If we're in an interrupt, have no user context or are running
 * in an atomic region then we must not take the fault:
 */
if (unlikely(in_atomic() || !mm)) {
    bad_area_nosemaphore(regs, error_code, address);
    return;
}

in_atomic returned true because we are in a write_lock_bh() lock! bad_area_nosemaphore eventually does the fixup.

If a page_fault would occur (which was unlikely, because of the concept of the working space) then the function call would fail and jump out of the __copy_user macro, with the uncopied bytes set to size because preemption was disabled.

Atween answered 27/9, 2012 at 20:53 Comment(0)

Page faults are not mask-able interrupts. In fact, they are not technically interrupts at all - but rather exceptions, although I agree the difference is more semantic.

The reason your copy_to_user failed when you called it in atomic context with interrupts disabled is because the code has an explicit check for this.

See http://lxr.free-electrons.com/source/arch/x86/lib/usercopy_32.c#L575

Scion answered 27/9, 2012 at 13:45 Comment(1)

Thanks for your answer. The call worked most of the time. It only failed very rarely. If it was because of the atomic context, I would expect it to fail always. That condition shouldn't be executed on a pentium anyways. , According to Linus boot_cpu_data.wp_works_ok should == 0 on everything greater than a 386. – Atween 27/9, 2012 at 18:38

Recommended topics

Hot tags