Deadlock with flock, fork and terminating parent process
Asked Answered
R

1

7

I have a pretty complicated python program. Internally it has a logging system that uses an exclusive (LOCK_EX) fcntl.flock to manage global locking. Effectively, whenever a log message is dumped, the global file lock is acquired, message is emitted to file (different from lock file) and global file lock is released.

The program also forks itself several times (after log management is set up). Generally everything works.

If the parent process is killed (and children stay alive), I occasionally get a deadlock. All programs block on the fcntl.flock() forever. Trying to acquire the lock externally also blocks forever. I have to kill the children programs to fix the problem.

What is baffling though is that lsof lock_file shows no process as holding the lock! So I cannot figure out why the file is being locked by the kernel but no process is reported as holding it.

Does flock have issues with forking? Is the dead parent somehow holding the lock even though it is no longer in the process table? How do I go about resolving this issue?

Retentive answered 2/2, 2012 at 3:59 Comment(2)
Alright I switched to fcntl.lockf which wraps fcntl locks (rather than flock). Deadlocks went away.Retentive
I suspect this is because flock locks the file descriptor (which still exited in child processes) while fcntl uses inode/pid to lock. What is strange though is that lsof not resolving that the children effectively own the flock; why is this the case?Retentive
P
4

lsof is almost certainly simply not showing flock() locks, so not seeing one tells you nothing about whether there is one.

flock() locks are inherited via fd-sharing (dup() system call, or fork-and-exec that leaves the file open) and anyone with the shared descriptor can unlock the lock, but if the lock is already held, any attempt to lock it again will block. So, yes, it's likely that the parent locked the descriptor, then died, leaving the descriptor locked. The child process then tries to lock as well and blocks because the descriptor is already locked. (The same would happen if a child process locked the file, then died.)

Since `fcntl()' locks are per-process, the dying process releases all its locks, so that you can proceed, which is what you want here.

Papke answered 9/3, 2012 at 1:49 Comment(2)
Thanks @torek. You mentioned "...it's likely that the parent locked the descriptor, then died, leaving the descriptor locked." My question is that, when parent dies, isn't the fd released automatically?Stealing
@sam: When a process dies, its open files are closed. Some locks will be released for this case, and some won't. The flock() ones won't unless this is the last close of the file. The fcntl(fd, F_SETLK, ...) ones will, even if this is not the last close. "Last" here is determined by the file-descriptor sharing via dup (there's a second kind of "last close" determined by the file itself, used mainly for device close and vnode recycling in vnode-based OSes). Check the documentation for the specific kind of lock you are using.Papke

© 2022 - 2024 — McMap. All rights reserved.