I'm not entirely aware of the choices Linux made, but the comment from the Linux kernel in the other answer points to stuff I've worked on in OpenBSD 13 years ago, so here my attempt at remembering what the hell was going on.
Because of the way open
is implemented, it first allocates a file descriptor, then it actually tries to finish the opening operation with the file descriptor table unlocked. One reason might be that we don't actually want to cause the side effects of open (the simplest would be changing atime on the file, but for example opening devices can have much more severe side effects) if it fails because we're out of file descriptors. The same applies to all other operations that allocate file descriptors, when you read the text below just substitute open
with "any system call that allocates file descriptors". I don't remember if this is mandated by POSIX or just The Way Things Have Always Been Done.
open
can allocate memory, go down to the file system and do a bunch of things that are potentially blocking for a long time. In the worst case for filesystems like fuse it might even go back up to userland. For that reason (and others) we don't actually want to hold the file descriptor table locked during the whole open operation. Locks inside the kernel are quite bad to hold while sleeping, doubly so if the completion of the locked operation might require interaction with userland[1].
The problem happens when someone calls open
in one thread (or a process that shares the same file descriptor table), it allocates a file descriptor and hasn't finished it yet while at the same time another thread does a dup2
pointing to the same file descriptor that open
just got. Since an unfinished file descriptor is still invalid (for example read
and write
will return EBADF when you try to use it) we can't actually close it just yet.
In OpenBSD this is solved by keeping track of allocated, but not yet open file descriptors with complex reference counting. Most operations will just pretend like the file descriptor isn't there (but it isn't allocatable either) and will just return EBADF
. But for dup2
we can't pretend it isn't there, because it is. The end result is that if two threads concurrently call open
and dup2
, open will actually perform a full open operation on the file, but since dup2
won the race for the file descriptor, the last thing open
does is to decrement the reference count on the file it just allocated and close it again. Meanwhile dup2
won the race and pretended to close the file descriptor that open
got (which it actually didn't do it was actually open
that did it). It doesn't really matter which behavior the kernel chooses since in both cases this is a race that will lead to unexpected behavior for either open
or dup2
. At best, Linux returning EBUSY is just shrinking the window for a race, but the race is still there, there's nothing preventing the dup2
call to happen just as open
is returning in the other thread and replace the file descriptor before the caller of open
has a chance to use it.
The error in your question will most likely happen when you hit this race. To avoid it do not dup2
to a file descriptor you don't know the state of unless you are sure that there is no one else that will be accessing the file descriptor table at the same time. And the only way to be sure is to be the only thread running (file descriptors are opened behind your back by libraries all the time) or knowing exactly what file descriptor you're overwriting. The reason dup2
over an unallocated file descriptor is allowed in the first place is that it's a common idiom to close fds 0, 1 and 2 and dup2 /dev/null into them.
On the other hand, not closing file descriptors before dup2
will lose the error return from close
. I wouldn't worry about that though, since the errors from close
are stupid and shouldn't be there in the first place: Handling C Read Only File Close Errors For another example of unexpected behavior of threads and how file descriptors behave strangely because of what I've been talking about here see this question: Socket descriptor not getting released on doing 'close ()' for a multi-threaded UDP client
Here's some example code to trigger this:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <err.h>
#include <pthread.h>
static void *
do_bad_things(void *v)
{
int *ip = v;
int fd;
sleep(2); /* pretend this is proper synchronization. */
if ((fd = open("/dev/null", O_RDONLY)) == -1)
err(1, "open 2");
if (dup2(fd, *ip))
warn("dup2");
return NULL;
}
int
main(int argc, char **argv)
{
pthread_t t;
int fd;
/* This will be our next fd. */
if ((fd = open("/dev/null", O_RDONLY)) == -1)
err(1, "open");
close(fd);
if (mkfifo("xxx", 0644))
err(1, "mkfifo");
if (pthread_create(&t, NULL, do_bad_things, &fd))
err(1, "pthread_create");
if (open("xxx", O_RDONLY) == -1)
err(1, "open fifo");
return 0;
}
A FIFO is the standard method to cause open
to block for as long as you wish. As expected, this works silently on OpenBSD and on Linux dup2
returns EBUSY. On MacOS for some reason it kills the shell where I did "echo foo > xxx", while a normal program that just opens it for writing works fine, I have no idea why.
[1] An anecdote here. I've been involved in writing a fuse-like filesystem used for an AFS implementation. One bug we had was that we held a file object lock while calling into the userland. The locking protocol for directory entry lookups requires you to hold the directory lock, then look up the directory entry, lock the object under that directory entry and then release the directory lock. Since we held file object lock, some other process came in and tried to look up the file, which led to that process to sleep for the file lock while still holding the directory lock. Another process came in, tried to look up the directory, and ended up holding the lock of the parent directory. Long story short, we ended up with a chain of locks held until we reached the root directory. Meanwhile the filesystem daemon was still talking to the server over the network. For some reason the network operation failed and the filesystem daemon needed to log an error message. To do that it had to read some locale database. And to do that it needed to open a file using the full path. But since the root directory was locked by someone else, the daemon waited for that lock. And we had a deadlock chain 8 locks long. That's why the kernel often performs complex contortionist gymnastics to avoid holding locks during long operations, especially filesystem operations.