Cannot Read a file from HDFS using Spark
Asked Answered
P

10

37

I have installed cloudera CDH 5 by using cloudera manager.

I can easily do

hadoop fs -ls /input/war-and-peace.txt
hadoop fs -cat /input/war-and-peace.txt

this above command will print the whole txt file on the console.

now I start the spark shell and say

val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
textFile.count

Now I get an error

Spark context available as sc.

scala> val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
2014-12-14 15:14:57,874 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(177621) called with curMem=0, maxMem=278302556
2014-12-14 15:14:57,877 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_0 stored as values in memory (estimated size 173.5 KB, free 265.2 MB)
textFile: org.apache.spark.rdd.RDD[String] = hdfs://input/war-and-peace.txt MappedRDD[1] at textFile at <console>:12

scala> textFile.count
2014-12-14 15:15:21,791 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 0 time(s); maxRetries=45
2014-12-14 15:15:41,905 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 1 time(s); maxRetries=45
2014-12-14 15:16:01,925 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 2 time(s); maxRetries=45
2014-12-14 15:16:21,983 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 3 time(s); maxRetries=45
2014-12-14 15:16:42,001 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 4 time(s); maxRetries=45
2014-12-14 15:17:02,062 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 5 time(s); maxRetries=45
2014-12-14 15:17:22,082 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 6 time(s); maxRetries=45
2014-12-14 15:17:42,116 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 7 time(s); maxRetries=45
2014-12-14 15:18:02,138 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 8 time(s); maxRetries=45
2014-12-14 15:18:22,298 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 9 time(s); maxRetries=45
2014-12-14 15:18:42,319 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 10 time(s); maxRetries=45
2014-12-14 15:19:02,354 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 11 time(s); maxRetries=45
2014-12-14 15:19:22,373 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 12 time(s); maxRetries=45
2014-12-14 15:19:42,424 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 13 time(s); maxRetries=45
2014-12-14 15:20:02,446 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 14 time(s); maxRetries=45
2014-12-14 15:20:22,512 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 15 time(s); maxRetries=45
2014-12-14 15:20:42,515 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 16 time(s); maxRetries=45
2014-12-14 15:21:02,550 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 17 time(s); maxRetries=45
2014-12-14 15:21:22,558 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 18 time(s); maxRetries=45
2014-12-14 15:21:42,683 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 19 time(s); maxRetries=45
2014-12-14 15:22:02,702 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 20 time(s); maxRetries=45
2014-12-14 15:22:22,832 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 21 time(s); maxRetries=45
2014-12-14 15:22:42,852 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 22 time(s); maxRetries=45
2014-12-14 15:23:02,974 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 23 time(s); maxRetries=45
2014-12-14 15:23:22,995 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 24 time(s); maxRetries=45
2014-12-14 15:23:43,109 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 25 time(s); maxRetries=45
2014-12-14 15:24:03,128 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 26 time(s); maxRetries=45
2014-12-14 15:24:23,250 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 27 time(s); maxRetries=45
java.net.ConnectException: Call From dn1home/192.168.1.21 to input:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1415)

Why did I get this error? I am able to read the same file by using hadoop commands?

Provincetown answered 15/12, 2014 at 5:47 Comment(0)
P
63

Here is the solution

sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt")

How did I find out nn1home:8020?

Just search for the file core-site.xml and look for xml element fs.defaultFS

Provincetown answered 15/12, 2014 at 5:54 Comment(5)
the core-site.xml is always located in the conf directory either on the local or cluster installation of Spark.Nickolenicks
for me config file was at $HADOOP_HOME/etc/hadoop/core-site.xmlVanvanadate
Without mn1home:8020, it should be sc.textFile("hdfs:////input/war-and-peace.txt")Welbie
This is when you are running code on hadoop cluster, not remotely.Right?Elsaelsbeth
I'm running on gcp dataproc, there is no $HADOOP_HOME environment variable set. How can I find the core-site.xml?Province
C
8

if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."

If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")

That's only one /

Chorion answered 19/6, 2016 at 12:39 Comment(0)
S
6

This will work:

val textFile = sc.textFile("hdfs://localhost:9000/user/input.txt")

Here, you can take localhost:9000 from hadoop core-site.xml config file's fs.defaultFS parameter value.

Sociology answered 28/9, 2017 at 11:52 Comment(0)
M
3

You are not passing a proper url string.

  • hdfs:// - protocol type
  • localhost - ip address(may be different for you eg. - 127.56.78.4)
  • 54310 - port number
  • /input/war-and-peace.txt - Complete path to the file you want to load.

Finally the URL should be like this

hdfs://localhost:54310/input/war-and-peace.txt
Minify answered 6/6, 2017 at 13:39 Comment(0)
G
2

If you started spark with HADOOP_HOME set in spark-env.sh, spark would know where to look for hdfs configuration files.

In this case spark already knows location of your namenode/datanode and only below should work fine to access hdfs files;

sc.textFie("/myhdfsdirectory/myfiletoprocess.txt")

You can create your myhdfsdirectory as below;

hdfs dfs -mkdir /myhdfsdirectory

and from your local file system you can move your myfiletoprocess.txt to hdfs directory using below command

hdfs dfs -copyFromLocal mylocalfile /myhdfsdirectory/myfiletoprocess.txt
Grosbeak answered 21/2, 2017 at 9:17 Comment(0)
M
1

I'm also using CDH5. For me the full path i,e "hdfs://nn1home:8020" is not working for some strange reason. Most of the example shows the path like that.

I used the command like

val textFile=sc.textFile("hdfs:/input1/Card_History2016_3rdFloor.csv")

o/p of above command:

textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:22

textFile.count

res1: Long = 58973  

and this works fine for me.

Martinet answered 4/8, 2016 at 13:23 Comment(0)
S
1

This worked for me

   logFile = "hdfs://localhost:9000/sampledata/sample.txt"
Subcartilaginous answered 11/11, 2016 at 7:28 Comment(0)
I
1
 val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
conf.set("fs.defaultFS", "hdfs://hostname:9000")
val sc = new SparkContext(conf)
val data = sc.textFile("hdfs://hostname:9000/hdfspath/")
data.saveAsTextFile("C:\\dummy\")

the above code reads all hdfs files from directory and save it locally in c://dummy folder.

Idioplasm answered 19/9, 2017 at 7:51 Comment(0)
F
1

It might be issue of file path or URL and hdfs port as well.

Solution: First open core-site.xml file from location $HADOOP_HOME/etc/hadoop and check the value of property fs.defaultFS. Let's say the value is hdfs://localhost:9000 and the file location in hdfs is /home/usr/abc/fileName.txt. Then, the file URL will be : hdfs://localhost:9000/home/usr/abc/fileName.txt and following command used to read file from hdfs:

var result= scontext.textFile("hdfs://localhost:9000/home/usr/abc/fileName.txt", 2)
Fancher answered 13/12, 2017 at 6:14 Comment(0)
G
1

Get the fs.defaultFS URL from core-site.xml(/etc/hadoop/conf) and read the file as below. In my case, fs.defaultFS is hdfs://quickstart.cloudera:8020

txtfile=sc.textFile('hdfs://quickstart.cloudera:8020/user/cloudera/rddoutput') txtfile.collect()

Gerhardine answered 28/11, 2019 at 1:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.