Race condition when using dup2
Asked Answered
G

2

13

This manpage for the dup2 system call says:

EBUSY (Linux only) This may be returned by dup2() or dup3() during a race condition with open(2) and dup().

What race condition does it talk about and what should I do if dup2 gives EBUSY error? Should I retry like in the case of EINTR?

Gastrulation answered 3/5, 2014 at 4:21 Comment(1)
So far I've found the following issue in OpenBSD Likely db.risk.io/cves/CVE-2001-1047 not sure if a similar situation os applicable to Linux but if it's handled gracefully, then, yes, you should treat it as EINTRTranscontinental
C
13

There is an explanation in fs/file.c, do_dup2():

/*
 * We need to detect attempts to do dup2() over allocated but still
 * not finished descriptor.  NB: OpenBSD avoids that at the price of
 * extra work in their equivalent of fget() - they insert struct
 * file immediately after grabbing descriptor, mark it larval if
 * more work (e.g. actual opening) is needed and make sure that
 * fget() treats larval files as absent.  Potentially interesting,
 * but while extra work in fget() is trivial, locking implications
 * and amount of surgery on open()-related paths in VFS are not.
 * FreeBSD fails with -EBADF in the same situation, NetBSD "solution"
 * deadlocks in rather amusing ways, AFAICS.  All of that is out of
 * scope of POSIX or SUS, since neither considers shared descriptor
 * tables and this condition does not arise without those.
 */
fdt = files_fdtable(files);
tofree = fdt->fd[fd];
if (!tofree && fd_is_open(fd, fdt))
    goto Ebusy;

Looks like EBUSY is returned when the descriptor to be freed is in some kind of incomplete state when it's still being opened (fd_is_open but not present in fdtable).

EDIT (more info and do want bounty)

In order to understand how !tofree && fd_is_open(fd, fdt) can happen, let's see how files are opened. Here a simplified version of sys_open :

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    /* ... irrelevant stuff */
    /* allocate the fd, uses a lock */
    fd = get_unused_fd_flags(flags);
    /* HERE the race condition can arise if another thread calls dup2 on fd */
    /* do the real VFS stuff for this fd, also uses a lock */
    fd_install(fd, f);
    /* ... irrelevant stuff again */
    return fd;
}

Basically two very important things happen: a file descriptor is allocated and only then it is actually opened by the VFS. These two operations modify the fdt of the process. They both use a lock, so nothing bad is to expect inside those two calls.

In order to memorize which fds have been allocated a bit vector called open_fds is used by the fdt. After get_unused_fd_flags(), the fd has been allocated and the corresponding bit set in open_fds. The lock on the fdt has been released, but the real VFS job hasn't been done yet.

At this precise moment, another thread (or another process in the case of shared fdt) can call dup2 which will not block because the locks have been released. If the dup2 took its normal path here, the fd would be replaced, but fd_install would be still run for the old file. Hence the check and return of Ebusy.

I found additional info on this race condition in the comments of fd_install() which confirms my explanation:

/* The VFS is full of places where we drop the files lock between
 * setting the open_fds bitmap and installing the file in the file
 * array.  At any such point, we are vulnerable to a dup2() race
 * installing a file in the array before us.  We need to detect this and
 * fput() the struct file we are about to overwrite in this case.
 *
 * It should never happen - if we allow dup2() do it, _really_ bad things
 * will follow. */
Caudex answered 3/5, 2014 at 12:33 Comment(13)
Could you provide more details on when this can happen? My impression is that it definitely can't happen in single-threaded processes (tasks not sharing their fd table (CLONE_FILES) but the comments in the kernel source ("All of that is out of scope of POSIX or SUS, since neither considers shared descriptor tables...") seem potentially wrong, since threads do share descriptor tables and are specified by POSIX.Gateshead
Further, my impression is that this error cannot occur when using dup2 correctly to atomically replace a descriptor the caller is aware of, only when using dup2 with a new fd that's unallocated and subject to race conditions if another thread calls open. My understanding is that such usage of dup2 invokes undefined behavior in a multithreaded process and thus the EBUSY error can be ignored (it doesn't happen without UB). Is this correct? I'm opening a bounty and will give it to whoever can clarify these things.Gateshead
@R.. Okay, but that sounds a bit more involved than this question… why not open a new question?Solan
@R. Unless I'm mistaken, kernel code (and comments) that can raise that condition (e.g. tofree = fdt->fd[fd]; might point to the dup'd and open fd, which causes goto Ebusy;) is the answer. As for what POSIX and SUS actually say, why does that matter? Also, did you see the NOTE in "man(2) dup2": "If newfd was open, any errors that would have been reported at close(2) time are lost. A careful programmer will not use dup2() or dup3() without closing newfd first."Conformal
@ElliottFrisch: The note in the man page is an error; the whole point of dup2 and dup3 is to replace a file descriptor atomically. Calling dup2 or dup3 when the destination fd is not already open is a serious bug with dire security and data-integrity consequences except in single-threaded processes. I reported this to Michael Kerrisk (man pages maintainer) just a few hours ago and I suspect he'll fix it.Gateshead
As for the cited kernel code being "the [complete] answer", unless you're highly familiar with kernel internals, I disagree. From a userspace perspective, one wants to know under what circumstances that condition in the kernel code might be true. @Potatoswatter: I think all of what I'm asking for is well within the scope of the original question, "what race condition does it talk about?" I think if I opened a new question it would be justified for close-voters to mark it as a duplicate of this one. :-pGateshead
@R..: You mean, change the note to something like "A careful programmer will first dup() the target descriptor, then use dup2()/dup3() to replace the target descriptor atomically, and finally close the initially duplicated target descriptor. This replaces the target descriptor atomically, but also retains a duplicate for closing so that close-time errors may be checked for. (In Linux, close() should only be called once, as the referred to descriptor is always closed, even in case of errno==EINTR.)"?Disbar
@NominalAnimal: Checking for close-time errors is of dubious utility, but thanks for the robust alternative. I'll pass it along.Gateshead
@R..: Looking at the kernel code, dup2() returning EBUSY occurs only when the descriptor is marked used, but the corresponding struct file * (fdt->fd[fd]) has not been filled in yet. In generic filesystem code, that only occurs with open()/creat()/openat() etc., so the man page stating EBUSY occurring (only) when dup() races against open() seems correct. However, a filesystem could, in theory, have an operation that removes the struct file * (fdt->fd[fd]) temporarily. One would need to check each filesystem to verify. Assuming my understanding is correct, of course.Disbar
@R..: If you are using NFS or other remote filesystems, checking for close() is definitely not of dubious utility. A careful programmer will always check close() return value -- that's what the dup() man page note refers to, after all. For example, fsync() and fdatasync() on NFS4 cause a corresponding call on the file server too. With battery-packed RAID controllers, that is not always necessary. close() only verifies all data has been successfully sent to the server, and often suffices.Disbar
So it's a race against open, etc. occurring in another thread (or process sharing CLONE_FILES) where the target descriptor was not already open before dup2/dup3 was called? It sounds to me like it can only happen if these functions are being used in a way that has a dangerous inherent race already, in which case programs using them correctly don't need to worry about EBUSY.Gateshead
@R.. I think your second comment has the valid explanation. Yes, dup2 is atomic but race condition can arise with other threads. From the manpage dup2() can close an already open fd x and then dup, and open() can open that particular fd x. That is a conflict. If Linux system gurantees atomicity, then the error EBUSY is a part of that guranatee. Throwing error in this case would be a well-defined behaviour. Otherwise conflict can cause undefined behaviour.Coverley
So does anyone want to write up a good detailed summary for the 500 bounty?Gateshead
A
8

I'm not entirely aware of the choices Linux made, but the comment from the Linux kernel in the other answer points to stuff I've worked on in OpenBSD 13 years ago, so here my attempt at remembering what the hell was going on.

Because of the way open is implemented, it first allocates a file descriptor, then it actually tries to finish the opening operation with the file descriptor table unlocked. One reason might be that we don't actually want to cause the side effects of open (the simplest would be changing atime on the file, but for example opening devices can have much more severe side effects) if it fails because we're out of file descriptors. The same applies to all other operations that allocate file descriptors, when you read the text below just substitute open with "any system call that allocates file descriptors". I don't remember if this is mandated by POSIX or just The Way Things Have Always Been Done.

open can allocate memory, go down to the file system and do a bunch of things that are potentially blocking for a long time. In the worst case for filesystems like fuse it might even go back up to userland. For that reason (and others) we don't actually want to hold the file descriptor table locked during the whole open operation. Locks inside the kernel are quite bad to hold while sleeping, doubly so if the completion of the locked operation might require interaction with userland[1].

The problem happens when someone calls open in one thread (or a process that shares the same file descriptor table), it allocates a file descriptor and hasn't finished it yet while at the same time another thread does a dup2 pointing to the same file descriptor that open just got. Since an unfinished file descriptor is still invalid (for example read and write will return EBADF when you try to use it) we can't actually close it just yet.

In OpenBSD this is solved by keeping track of allocated, but not yet open file descriptors with complex reference counting. Most operations will just pretend like the file descriptor isn't there (but it isn't allocatable either) and will just return EBADF. But for dup2 we can't pretend it isn't there, because it is. The end result is that if two threads concurrently call open and dup2, open will actually perform a full open operation on the file, but since dup2 won the race for the file descriptor, the last thing open does is to decrement the reference count on the file it just allocated and close it again. Meanwhile dup2 won the race and pretended to close the file descriptor that open got (which it actually didn't do it was actually open that did it). It doesn't really matter which behavior the kernel chooses since in both cases this is a race that will lead to unexpected behavior for either open or dup2. At best, Linux returning EBUSY is just shrinking the window for a race, but the race is still there, there's nothing preventing the dup2 call to happen just as open is returning in the other thread and replace the file descriptor before the caller of open has a chance to use it.

The error in your question will most likely happen when you hit this race. To avoid it do not dup2 to a file descriptor you don't know the state of unless you are sure that there is no one else that will be accessing the file descriptor table at the same time. And the only way to be sure is to be the only thread running (file descriptors are opened behind your back by libraries all the time) or knowing exactly what file descriptor you're overwriting. The reason dup2 over an unallocated file descriptor is allowed in the first place is that it's a common idiom to close fds 0, 1 and 2 and dup2 /dev/null into them.

On the other hand, not closing file descriptors before dup2 will lose the error return from close. I wouldn't worry about that though, since the errors from close are stupid and shouldn't be there in the first place: Handling C Read Only File Close Errors For another example of unexpected behavior of threads and how file descriptors behave strangely because of what I've been talking about here see this question: Socket descriptor not getting released on doing 'close ()' for a multi-threaded UDP client

Here's some example code to trigger this:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <err.h>
#include <pthread.h>

static void *
do_bad_things(void *v)
{
    int *ip = v;
    int fd;

    sleep(2);   /* pretend this is proper synchronization. */

    if ((fd = open("/dev/null", O_RDONLY)) == -1)
        err(1, "open 2");

    if (dup2(fd, *ip))
        warn("dup2");

    return NULL;
}

int
main(int argc, char **argv)
{
    pthread_t t;
    int fd;

    /* This will be our next fd. */
    if ((fd = open("/dev/null", O_RDONLY)) == -1)
        err(1, "open");
    close(fd);

    if (mkfifo("xxx", 0644))
        err(1, "mkfifo");

    if (pthread_create(&t, NULL, do_bad_things, &fd))
        err(1, "pthread_create");

    if (open("xxx", O_RDONLY) == -1)
        err(1, "open fifo");

    return 0;
}

A FIFO is the standard method to cause open to block for as long as you wish. As expected, this works silently on OpenBSD and on Linux dup2 returns EBUSY. On MacOS for some reason it kills the shell where I did "echo foo > xxx", while a normal program that just opens it for writing works fine, I have no idea why.

[1] An anecdote here. I've been involved in writing a fuse-like filesystem used for an AFS implementation. One bug we had was that we held a file object lock while calling into the userland. The locking protocol for directory entry lookups requires you to hold the directory lock, then look up the directory entry, lock the object under that directory entry and then release the directory lock. Since we held file object lock, some other process came in and tried to look up the file, which led to that process to sleep for the file lock while still holding the directory lock. Another process came in, tried to look up the directory, and ended up holding the lock of the parent directory. Long story short, we ended up with a chain of locks held until we reached the root directory. Meanwhile the filesystem daemon was still talking to the server over the network. For some reason the network operation failed and the filesystem daemon needed to log an error message. To do that it had to read some locale database. And to do that it needed to open a file using the full path. But since the root directory was locked by someone else, the daemon waited for that lock. And we had a deadlock chain 8 locks long. That's why the kernel often performs complex contortionist gymnastics to avoid holding locks during long operations, especially filesystem operations.

Angelo answered 3/6, 2014 at 9:48 Comment(8)
Thank you for the detailed answer. I think this addresses my main question of whether EBUSY can happen under correct usage of dup2: where either the destination file descriptor is known to be open prior to the dup2 call and not going to be closed during it, or there is a guarantee that no other thread will allocate a fd during the dup2 call, which can basically only be guaranteed when the program is single-threaded or extremely simple. This answer is definitely worthy of the bounty, so +500. :-)Gateshead
Added some example code and updated some of the text now that I had a chance to sleep on it.Angelo
Your example makes this answer even better, because it (1) demonstrates a case where looping on EBUSY leads to deadlock, and (2) almost violates XSH 2.9.7 Thread Interactions with Regular File Operations except for the fact that a FIFO is not a regular file. If Linux can cause this EBUSY to happen when opening a regular file (with a much harder-to-hit race) then I think this issue is actually a conformance problem.Gateshead
You can definitely hit this race for regular files. If you want to experiment with it, the simplest way to do it without actually racing is to open a regular file on an NFS mount where the NFS server is down or has a huge network latency. Linux already violates POSIX here because as far as I understand, you can't return random errors and EBUSY is not specified as a valid error for dup2 in POSIX. On the other hand, this probably means that POSIX will get updated.Angelo
Speaking of updating POSIX and dup2, take a look at what is says about FD_CLOEXEC. I'm sure it wasn't there a decade ago. For an educated guess about the history of that take a look at this: openbsd.org/cgi-bin/cvsweb/~checkout~/src/regress/sys/kern/… Neither the clearing of FD_CLOEXEC nor the not clearing of FD_CLOEXEC used to be documented. POSIX was just updated to match the same implementation mistake that everyone did. Now Linux has dup3 that makes the behavior of this mistake explicit.Angelo
Unless an interface defines no errors at all, POSIX allows an implementation to return additional errors not defined in the standard as long as none of the standard error codes for that interface are reused for nonstandard error conditions. IMO this is something of a flaw in POSIX, since there are many instances where a portable application has no way to respond to such nonstandard errors except to abort (e.g. what do you do if pthread_mutex_unlock returns an error when it shouldn't?). But I don't think the extra error EBUSY is necessarily nonconforming in itself.Gateshead
What seems like more of a problem to me is that the open and dup2 operations are not atomic with respect to each other, although one could argue that, with the error considered as an abstract failure not necessarily related to the open, it's true that each of open and dup2 see all the side effects of the other, or none of the side effects.Gateshead
Both FreeBSD and NetBSD block in the dup2 system call; I can't see any signs of "amusing" deadlocks, perhaps they were fixed. This seems kind of reasonable given the unreasonableness of the situation, but an amend to the man pages might be in order.Ard

© 2022 - 2024 — McMap. All rights reserved.