How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?

Asked 12/6, 2015 at 6:52 Answered 31/1, 2022 at 17:46

Now I have some Spark applications which store output to HDFS.

Since our hadoop cluster is consisting of namenode H/A, and spark cluster is outside of hadoop cluster (I know it is something bad) I need to specify HDFS URI to application so that it can access HDFS.

But it doesn't recognize name service so I can only give one of namenode's URI, and if it fails, modify configuration file and try again.

Accessing Zookeeper for revealing active seems to very annoying, so I'd like to avoid.

Could you suggest any alternatives?

Naphthyl answered 12/6, 2015 at 6:52 Comment(5)

you can use Active NameNode URI to connect. It should be like this : hdfs://hostname:8020 – Geer 12/6, 2015 at 6:57

Sorry but I already did it. I want to know how to find active namenode without manual revealing. – Naphthyl 12/6, 2015 at 7:28

http://<namenode_hostname>:50070/dfshealth.jsp. here you will get which is on active state. – Geer 12/6, 2015 at 7:45

You can use this command too hadoop dfsadmin -report to get the status. – Geer 12/6, 2015 at 7:46

Thanks for additional information, but I'm trying to avoid "manual" revealing. Spark Application should find active namenode automatically. – Naphthyl 12/6, 2015 at 7:49

Suppose your nameservice is 'hadooptest', then set the hadoop configurations like below. You can get these information from hdfs-site.xml file of remote HA enabled HDFS.

sc.hadoopConfiguration.set("dfs.nameservices", "hadooptest")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.hadooptest", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
sc.hadoopConfiguration.set("dfs.ha.namenodes.hadooptest", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn1", "10.10.14.81:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn2", "10.10.14.82:8020")

After this, you can use the URL with 'hadooptest' like below.

test.write.orc("hdfs://hadooptest/tmp/test/r1")

check here for more information.

Garneau answered 12/9, 2016 at 7:34 Comment(0)

If you want to make a H/A HDFS cluster as your default config (mostly the case) that applies to every application started through spark-submit or spark-shell. you could write the cluster information into spark-defaults.conf.

sudo vim $SPARK_HOME/conf/spark-defaults.conf

And add the following lines. assuming your HDFS cluster name is hdfs-k8s

spark.hadoop.dfs.nameservices   hdfs-k8s
spark.hadoop.dfs.ha.namenodes.hdfs-k8s  nn0,nn1
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn0 192.168.23.55:8020
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn1 192.168.23.56:8020
spark.hadoop.dfs.client.failover.proxy.provider.hdfs-k8s    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

It should work when your next application launched.

sc.addPyFile('hdfs://hdfs-k8s/user/root/env.zip')

Triiodomethane answered 21/10, 2019 at 3:9 Comment(0)

For kerberos enabled clusters, you can access HDFS using following properties. More information here. These information you can get from remove HA hdfs-site.xml file.

spark.sparkContext.hadoopConfiguration.set("dfs.nameservices", "testnameservice")
spark.sparkContext.hadoopConfiguration.set("dfs.client.failover.proxy.provider.testnameservice", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
spark.sparkContext.hadoopConfiguration.set("dfs.ha.namenodes.testnameservice", "nn1,nn2")
spark.sparkContext.hadoopConfiguration.set("dfs.namenode.rpc-address.testnameservice.nn1", "namenode1_hostname:8020")
spark.sparkContext.hadoopConfiguration.set("dfs.namenode.rpc-address.testnameservice.nn2", "namenode2_hostname:8020")
spark.read.csv("hdfs://testnameservice/path/to/hdfs/sample.csv")

If you have also set spark to access kerberos token while launching with this property spark.kerberos.access.hadoopFileSystems for spark > 3.0 or spark.kerberos.access.namenodes for Spark < 3.0 as mentioned here. Unfortunately, for this, it requires only active namenode configuration and you have to poll namenode service or namenode at http://namenode_service:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus and retrieve active namenode.

Loxodrome answered 31/1, 2022 at 17:46 Comment(0)

Copy hadoop configuration dir to your spark cluster
Point spark to this dir by setting HADOOP_CONF_DIR in spark-env.sh

e.g.

echo "HADOOP_CONF_DIR=\"/opt/hadoop/etc/hadoop\"" > spark-env.sh

Emasculate answered 12/9, 2016 at 8:1 Comment(1)

It should be 2 >> so it doesn't overwrite the whole file – Indue 17/8, 2018 at 23:53

I came across the similar type of issue. In my case, i was having the list of hosts of HA enabled environment, but no information above the "Active" node.

To solve the problem, i used the webhdfs call to get the status of each node, this is the webhdfs call that i used in my code -

curl 'http://[hdfsHost]:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus'

I make above call with different HDFS hosts. It return "state" of each node in json output, like this

 { 
  "..." : [ {
    "name" : "Hadoop:service=NameNode,name=NameNodeStatus",
     "modelerType" : "org.apache.hadoop.hdfs.server.namenode.NameNode",
     "State" : "active",
    .......
  } ]
}

if node is stand by you will get "State" : "standby"

Once you get the JSON, you can parse the json and get the state value.

Hardandfast answered 24/3, 2017 at 12:36 Comment(0)

Recommended topics

Hot tags