what are the differences zookeeper, journal node tasks and quorum journal manager in hadoop?
Asked Answered
O

3

10

On studying the material in multiple no of websites and videos, I am confused with the functionalities and differences in the purposes of the 3 hadoop components ZooKeeper, Journal Node and the Quorum Journal Manager.

Could anyone please explain me the reasons for inventing each of the above and differences in the purposes and functionalities of the above three components?

Thanks in advance.

Oddity answered 25/9, 2014 at 12:13 Comment(1)
W
5

Think of it like this, zookeeper is a group of people, each assigned to watch over a factory and coordinate them, journal node is a place where all factory managers can check others status and coordinate. QJM is a combination of both to be used in HA for better coordination in case of fail over.

zookeeper coordinates hbase regionservers and other hadoop modules which require zookeeper.

journal node coordinates hadoop datanodes with the namenode.

QJM coordinates regionservers using the technique used by journal node

on core hadoop setup only journal node is necessary in case of distributed setup

Wainscoting answered 25/9, 2014 at 18:3 Comment(6)
Thank you Antariksha Yelkawar for your answer. But could you please exactly say the tasks of each in Hadoop sysytem? Then it would be clearer than the one I got from above answer.Oddity
What exactly mean by 'Coordination'? Could you please explain the functionality and workflow or at what sort of situation the above components take the control? I am not exactly getting what sort of coordination those components are doing?Oddity
by coordination i mean for example, when running a mapreduce job there is a need for dividing the job between nodes this is coordinated by journal node. It would be better if you read the documentation for zk and jnodeWainscoting
I searched multiple documents and links but the functionality and the time they get invoked for a task is not clearly differentiated and more over it confused me. So, I am waiting for nice, clear and in-depth explanation.Oddity
hello, I have done setup of hadoop but, it doesn't displaying process for journal node in hadoop 2.4.1. Is this feature latest after this version?Barbaraanne
you must have user yarn, which does not use jobtracker or journal node but uses resourcemanager and nodemanagerWainscoting
S
2

Firstly, quorum means there is a need of majority for decisions. So, when you see the word "quorum" you should think of a clustered, saying that; multi-host configuration. You can hear this term for both Zookeeper and Journal Nodes.

Short description of their functionalities will help you distinguish their purpose.

Zookeeper: Zookeeper is the central synchronisation application for informations which applications need to check frequently. There may be many informations that application need like naming structure, information, configuration information (or simply configurations) etc. Most common case is configuration of application. When you change a config which relates to lets say 80 servers, to synchronise this change to all nodes, you need to develop a synchronisation service. Application itself may have this feature. But imagine you add another 12 applications to your environment. You need to take care of each application's synchronisation service one by one. This is where zookeeper comes in. Zookeeper can handle management of all these information by itself. If you set it up as a cluster (need an odd number of hosts. why?) you will have high availability for Zookeeper (failover cases) and have a Zoopeeker Quorum.

Journal Node: In an high availability Hadoop cluster you have more than one Namenodes running in active/passive mode. Active namenode informs journal node for changes. Stand by name node asks to journal node about what changed. Like on the case of Zookeeper if you set up as cluster configuration (need odd number of hosts also here. why?), you have high availability also for Journal Node features and have a Quorum Journal Manager.

Actually I didn't hear them set as single host or node except for lab purposes (vm in pc).

Scrounge answered 10/2, 2017 at 14:44 Comment(0)
I
1

1. Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications

Role of Zookeeper in Hadoop ecosystem:

During the Hadoop Namenode failover process, ZooKeeper has been used to avoid split brain scenario so that name node state is not getting diverged due to failover.

Refer to this post for more details:

How does Hadoop Namenode failover process works?

2. JournalNode ( Used in Namenode failover process)

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs).

JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager.

Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine

3.Quorum Journal Manager (QJM) allows to share edit logs between the Active and Standby NameNodes

Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario

Incapable answered 11/2, 2017 at 17:45 Comment(1)
does NN depend of JN daemon to be started first. ?Literati

© 2022 - 2024 — McMap. All rights reserved.