zookeeper lock stayed locked
Asked Answered
B

2

8

I am using celery and zookeeper (kazoo lock) to lock my workers. I have a problem when I kill (-9) one of the workers before releasing the lock then that lock stays locked forever.

So my question is: Does killing the process release locks in that process or is this some bug in zookeeper?

Balkanize answered 21/12, 2012 at 21:14 Comment(3)
What caught my eye was not the actual question, but the extremely comedic titleSeparatist
@inspectorG4dget: And the question does not fail to deliver, e.g, "When I kill one of the workers"Superorganic
@Milan Kocic I am facing the same issue, wherein if the thread possessing the lock is killed, other threads keep on waiting. Ephemeral nodes in my case are only deleted when I close apache tomcat. Right now, I have a workaround wherein the other thread checks the time on parent persistent node and release the lock if it is not updated in last 2*x time. But, I think this is not an elegant way. Did, you find the reason why it was happening and how you solved the issue?Ballon
F
1

Killing a process with a kill signal will do nothing to clear "software locks" such as ZooKeeper locks.

The only kind of locks killed by a KILL signal are OS-level locks, since all file descriptors are killed, and file descriptor locks are therefore killed as well. But as far as ZooKeeper is concerned, those are not OS level locks (would it be only because the ZooKeeper process, even on the same machine, is not the one of your python process).

It is therefore not a bug in ZooKeeper, and an expected behavior of your kill -9.

Fanchan answered 21/12, 2012 at 21:17 Comment(8)
Thank you, this confirm my suspicions.Balkanize
Can you, please tell me, for example, if i have locked zookeeper lock for long time and i got connection suspended or lost in listener. What will happen with that lock if i get connection connected after some time, should i keep that task (is current lock ok) or i should repeat task (and create new lock on client side and to another locking). So should i wait for connected signal or kill and repeat current task if i get suspended or lost signal during lock is locked.Balkanize
I don't know, to be honest, the ZooKeeper documentation might tell you more here.Fanchan
kill -9 should cause the zookeeper session to die, which should cause the ephemeral node to die, which should cause the zookeeper lock to be releasedUnbuild
@sbridges: no, kill -9 will not kill TCP connections, and a ZooKeeper session is linked to a connection. As such, the ephemeral node still exists if you kill -9.Fanchan
if you kill the process, it won't kill all tcp connections associated with that process? Even if it didn't, the process has to do work to send heartbeats, and the hearbeats won't be sent, so the ephemeral node will die.Unbuild
No, it won't kill TCP connections, ie, it will not send an RST to the other end. Yes, heartbeats won't be sent, but how frequently are these heartbeats sent? How is heartbeat timeout determined?Fanchan
it won't kill the tcp connection on the server, it will kill it on the client. session timeout should be on the order of seconds, 5-30 seconds is a normal rangeUnbuild
U
10

Zookeeper locks use ephemeral nodes. An ephemeral node is a node that lives as long as the session that created it is alive. Sessions are kept alive by the process creating the session periodically sending a heartbeat message to zookeeper.

So if you kill the process that created the lock, the lock will eventually be released, as the session will die as zookeeper no longer receives heartbeats.

So killing a worker before the lock is released should eventually release the lock.

If the lock is never released, a couple things could be happening,

  1. Someone else noticed the lock was released and obtained it. Presumably you are locking because there is contention, and some other process will try and acquire the lock when it is released.
  2. You aren't waiting long enough. When you connect to zookeeper there should be a session timeout parameter you set, that is how long the server will keep the session alive without hearing any heartbeats, you have to wait this long to see the locks released
  3. There is a bug in kazoo. This is possible, but it looks like the kazoo lock recipe uses ephemeral nodes, and the use case you describe is a very basic one.

It is very unlikely this is a zookeeper bug.

How do you know the lock is not being released?

Unbuild answered 22/12, 2012 at 14:45 Comment(4)
I know the lock is not released because all other workers wait on acquiring, also i know that at least one worker is killed before it released lock. Why worker is killed you may asked: I do this because i got lost connection signal during lock is locked and i don't know how lock working then, does it stayed locked on server for other workers (that is what i want). Operations between acquire and release needs a long time so if i got suspend or lost connection signal during lock is locked, is it good to wait connect signal or i should stop that task and repeat it.Balkanize
Can you reproduce with just one worker? Start it, have it acquire the lock, kill -9 it, then use the zookeeper command line to see what nodes are in zookeeper. You shouldn't have any ephemeral nodes left after you kill the worker, and session timeout has passed.Unbuild
I ll try but i cannot do that know i have demo tomorrow:) The best situation for me would be if acquire and release waits on connection connected signal and only then keep to next operation. So i can know that critical section is safe always.Balkanize
Everything is fine with zookeeper. I had some cases when it needs long time (about 10 hours) to release lock but this is cause some worker didn't properly release lock and lock stay locked so maximal timeout need to expire so lock could be release.Balkanize
F
1

Killing a process with a kill signal will do nothing to clear "software locks" such as ZooKeeper locks.

The only kind of locks killed by a KILL signal are OS-level locks, since all file descriptors are killed, and file descriptor locks are therefore killed as well. But as far as ZooKeeper is concerned, those are not OS level locks (would it be only because the ZooKeeper process, even on the same machine, is not the one of your python process).

It is therefore not a bug in ZooKeeper, and an expected behavior of your kill -9.

Fanchan answered 21/12, 2012 at 21:17 Comment(8)
Thank you, this confirm my suspicions.Balkanize
Can you, please tell me, for example, if i have locked zookeeper lock for long time and i got connection suspended or lost in listener. What will happen with that lock if i get connection connected after some time, should i keep that task (is current lock ok) or i should repeat task (and create new lock on client side and to another locking). So should i wait for connected signal or kill and repeat current task if i get suspended or lost signal during lock is locked.Balkanize
I don't know, to be honest, the ZooKeeper documentation might tell you more here.Fanchan
kill -9 should cause the zookeeper session to die, which should cause the ephemeral node to die, which should cause the zookeeper lock to be releasedUnbuild
@sbridges: no, kill -9 will not kill TCP connections, and a ZooKeeper session is linked to a connection. As such, the ephemeral node still exists if you kill -9.Fanchan
if you kill the process, it won't kill all tcp connections associated with that process? Even if it didn't, the process has to do work to send heartbeats, and the hearbeats won't be sent, so the ephemeral node will die.Unbuild
No, it won't kill TCP connections, ie, it will not send an RST to the other end. Yes, heartbeats won't be sent, but how frequently are these heartbeats sent? How is heartbeat timeout determined?Fanchan
it won't kill the tcp connection on the server, it will kill it on the client. session timeout should be on the order of seconds, 5-30 seconds is a normal rangeUnbuild

© 2022 - 2024 — McMap. All rights reserved.