Please check Namenode HA architecture with key entities in HDFS client requests handling.
Where this request go first? I mean how would client know that which
namenode is active?
For client/driver it doesn't matter which namenode is active. because we query on HDFS with nameservice id rather than hostname of namenode. nameservice will automatically transfer client requests to active namenode.
Example: hdfs://nameservice_id/rest/of/the/hdfs/path
Explanation:
How this hdfs://nameservice_id/
works and what are the confs involved in it?
In hdfs-site.xml
file
Create a nameservice by adding an id to it(here nameservice_id
is mycluster
)
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
<description>Logical name for this new nameservice</description>
</property>
Now specify namenode ids to determine namenodes in cluster
dfs.ha.namenodes.[$nameservice ID]
:
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
<description>Unique identifiers for each NameNode in the nameservice</description>
</property>
Then link namenode ids with namenode hosts
dfs.namenode.rpc-address.[$nameservice ID].[$name node ID]
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
After that specify the Java class that HDFS clients use to contact the Active NameNode so that DFS Client uses this class to determine which NameNode is currently serving client requests.
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
Finally HDFS URL will be like this after these configuration changes.
hdfs://mycluster/<file_lication_in_hdfs>
To answer your question I have taken few configuration only. please check the detailed documentation for how does Namenodes, Journalnodes and Zookeeper machines form Namenode HA in HDFS.