How does Hadoop Namenode failover process works? [closed]

According to the hadoop docs, that you can find here, in order to implement automatic failover there are a couple of things that need to be added to an HDFS deployment:

1: a Zookeeper quorum

2: the ZKFailoverController process.

To answer your questions from the docs:

Each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered

So to answer your questions:

Q: How come a namenode can run something to detect its own failure?

A: Each name-node maintains a session on ZooKeeper via a ZKFailoverController (ZKFC) service that runs on the same machine. When this session expires the other name-node will be notified that a failover should be triggered.

The ZKFC health monitor also periodically pings its local name-node (this is your heartbeat), if the name-node crashes the health monitor marks that name-node as unhealthy.

When the failed name-node is healthy and is the active name-node, it maintains a special "lock" znode. When the name-node is marked as unhealthy, this lock is deleted. When another name node sees that no other node currently holds the lock znode it will try and acquire the lock. If it does this, then it becomes the active name-node.

Q: Who sends heartbeat to whom? How it detects namenode failure?

A: Again. ZooKeeper session.

Q: Where this process runs?

A: You can install ZooKeeper on a single machine or a cluster. You can read the docs here.

Q: To whom it notify for the transition?

A: This is all handled by ZKFailoverController process running on each machine.

There's another good article here, which visualises this a bit better than my words.

Recommended topics

Hot tags