HDFS error: could only be replicated to 0 nodes, instead of 1
Asked Answered
B

14

70

I've created a ubuntu single node hadoop cluster in EC2.

Testing a simple file upload to hdfs works from the EC2 machine, but doesn't work from a machine outside of EC2.

I can browse the the filesystem through the web interface from the remote machine, and it shows one datanode which is reported as in service. Have opened all tcp ports in the security from 0 to 60000(!) so I don't think it's that.

I get the error

java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)

at org.apache.hadoop.ipc.Client.call(Client.java:905)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

namenode log just gives the same error. Others don't seem to have anything interesting

Any ideas?

Cheers

Bendigo answered 14/3, 2011 at 0:11 Comment(2)
I had a problem in setting up a single node VM. I removed configuration properties from conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml. It works fine on my VM. Disclaimer: I am an absolute beginner. I think these changes leads to a default values for a single instance and that made it work. HTH.Electrothermal
I have also had the same problem/ error. The problem occurred in the first place when I formatted using hadoop namenode -format So after restarting hadoop using, start-all.sh, the data node did not start or initialize. You can check this using jps, there should be five entries. If datanode is missing, then you can do this: #11889761Witchcraft
C
76

WARNING: The following will destroy ALL data on HDFS. Do not execute the steps in this answer unless you do not care about destroying existing data!!

You should do this:

  1. stop all hadoop services
  2. delete dfs/name and dfs/data directories
  3. hdfs namenode -format Answer with a capital Y
  4. start hadoop services

Also, check the diskspace in your system and make sure the logs are not warning you about it.

Colettacolette answered 6/1, 2012 at 20:48 Comment(5)
Now I see this, I remember something similar saving me before. And it saved me again today, thanks. I had been assuming 'namenode -format' blanked everything down, but there was some messed up state surviving.Blackguardly
how is deleting all the files a solution?? how strange!!Noisy
Can somebody comment on the problem underlying this? I have only ephemeral data stored in HDFS, so this works. I'd prefer to change whatever configuration needs to be changed so that I can prevent this from happening again.Nudge
@Colettacolette where do I find dfs/name and dfs/data directory in hadoop-2.9.0? I tried find command but it didn't work.Readymade
@Noisy yea, reformatting should never be the answer. But judging from the number of upvotes, apparently this problem pops up mostly to people setting up their test clusters, who probably did not format it correctly the first time..Leanneleanor
D
14

This is your issue - the client can't communicate with the Datanode. Because the IP that the client received for the Datanode is an internal IP and not the public IP. Take a look at this

http://www.hadoopinrealworld.com/could-only-be-replicated-to-0-nodes/

Look at the sourcecode from DFSClient$DFSOutputStrem (Hadoop 1.2.1)

//
// Connect to first DataNode in the list.
//
success = createBlockOutputStream(nodes, clientName, false);

if (!success) {
  LOG.info("Abandoning " + block);
  namenode.abandonBlock(block, src, clientName);

  if (errorIndex < nodes.length) {
    LOG.info("Excluding datanode " + nodes[errorIndex]);
    excludedNodes.add(nodes[errorIndex]);
  }

  // Connection failed. Let's wait a little bit and retry
  retry = true;
}

The key to understand here is that Namenode only provide the list of Datanodes to store the blocks. Namenode does not write the data to the Datanodes. It is the job of the Client to write the data to the Datanodes using the DFSOutputStream . Before any write can begin the above code make sure that the Client can communicate with the Datanode(s) and if the communication fails to the Datanode, the Datanode is added to the excludedNodes .

Dynamics answered 17/2, 2014 at 16:47 Comment(2)
If it's indeed the issue, how can I do to have the public IP address when connecting to the AWS cluster ? ThanksCristalcristate
I was running Talend from Windows machine. I made an entry in the windows hosts file - <<public ip address of EC2>> <<internal or private hostname>>.Siloum
C
9

Look at following:

By seeing this exception(could only be replicated to 0 nodes, instead of 1), datanode is not available to Name Node..

This are the following cases Data Node may not available to Name Node

  1. Data Node disk is Full

  2. Data Node is Busy with block report and block scanning

  3. If Block Size is Negative value(dfs.block.size in hdfs-site.xml)

  4. while write in progress primary datanode goes down(Any n/w fluctations b/w Name Node and Data Node Machines)

  5. when Ever we append any partial chunk and call sync for subsequent partial chunk appends client should store the previous data in buffer.

For example after appending "a" I have called sync and when I am trying the to append the buffer should have "ab"

And Server side when the chunk is not multiple of 512 then it will try to do Crc comparison for the data present in block file as well as crc present in metafile. But while constructing crc for the data present in block it is always comparing till the initial Offeset Or For more analysis Please the data node logs

Reference: http://www.mail-archive.com/[email protected]/msg01374.html

Chandos answered 11/11, 2011 at 15:53 Comment(2)
also happens if datanode can not reach namenode on its listening port (eg: 9000). See https://mcmap.net/q/281320/-file-jobtracker-info-could-only-be-replicated-to-0-nodes-instead-of-1Maxim
A port issue was what caused the OP's error for me. I did not have the dfs.datanode.address port address open (which is 50010 by default for CDH).Pussy
B
8

I had a similar problem setting up a single node cluster. I realized that I didn't config any datanode. I added my hostname to conf/slaves, then it worked out. Hope it helps.

Bison answered 8/10, 2011 at 10:23 Comment(1)
I had an empty line in slaves/master file at the end and it was failing because of that :/Urology
C
4

I'll try to describe my setup & solution: My setup: RHEL 7, hadoop-2.7.3

I tried to setup standalone Operation first and then Pseudo-Distributed Operation where the latter failed with the same issue.

Although, when I start hadoop with:

sbin/start-dfs.sh

I got the following:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-secondarynamenode-localhost.localdomain.out

which looks promising (starting datanode.. with no failures) - but the datanode wasn't exist indeed.

Another indication was to see that there is no datanode in operation (the below snapshot shows fixed working state):

enter image description here

I've fix that issue by doing:

rm -rf /tmp/hadoop-<user>/dfs/name
rm -rf /tmp/hadoop-<user>/dfs/data

and then start again:

sbin/start-dfs.sh
...
Consolatory answered 2/9, 2016 at 10:12 Comment(1)
i do not have any /tmp/hadoop/* file, but what yu exposed matches my problemEpicenter
D
3

I had the same error on MacOS X 10.7 (hadoop-0.20.2-cdh3u0) due to data node not starting.
start-all.sh produced following output:

starting namenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
localhost: ssh: connect to host localhost port 22: Connection refused
localhost: ssh: connect to host localhost port 22: Connection refused
starting jobtracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
localhost: ssh: connect to host localhost port 22: Connection refused

After enabling ssh login via System Preferences -> Sharing -> Remote Login it started to work.
start-all.sh output changed to following (note start of datanode):

starting namenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting datanode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting secondarynamenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
starting jobtracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting tasktracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Diagraph answered 19/7, 2012 at 11:54 Comment(0)
G
2

And I think you should make sure all the datanodes are up when you do copy to dfs. In some case, it takes a while. I think that's why the solution 'checking the health status' works, because you go to the health status webpage and wait for everything up, my five cents.

Gaeta answered 29/10, 2011 at 16:34 Comment(0)
E
2

It take me a week to figure out the problem in my situation.

When the client(your program) ask the nameNode for data operation, the nameNode picks up a dataNode and navigate the client to it, by giving the dataNode's ip to the client.

But, when the dataNode host is configured to has multiple ip, and the nameNode gives you the one your client CAN'T ACCESS TO, the client would add the dataNode to exclude list and ask the nameNode for a new one, and finally all dataNode are excluded, you get this error.

So check node's ip settings before you try everything!!!

Enchiridion answered 12/5, 2017 at 6:52 Comment(0)
C
1

If all data nodes are running, one more thing to check whether the HDFS has enough space for your data. I can upload a small file but failed to upload a big file (30GB) to HDFS. 'bin/hdfs dfsadmin -report' shows that each data node only has a few GB available.

Cherie answered 16/6, 2014 at 23:24 Comment(0)
N
0

Have you tried the recommend from the wiki http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment ?

I was getting this error when putting data into the dfs. The solution is strange and probably inconsistent: I erased all temporary data along with the namenode, reformatted the namenode, started everything up, and visited my "cluster's" dfs health page (http://your_host:50070/dfshealth.jsp). The last step, visiting the health page, is the only way I can get around the error. Once I've visited the page, putting and getting files in and out of the dfs works great!

Nolte answered 14/3, 2011 at 14:41 Comment(1)
I'm having the same problem described in the question, found and used this method, but had no success.Showmanship
K
0

Reformatting the node is not the solution. You will have to edit the start-all.sh. Start the dfs, wait for it to start completely and then start mapred. You can do this using a sleep. Waiting for 1 second worked for me. See the complete solution here http://sonalgoyal.blogspot.com/2009/06/hadoop-on-ubuntu.html.

Kobarid answered 29/7, 2011 at 12:14 Comment(0)
O
0

I realize I'm a little late to the party, but I wanted to post this for future visitors of this page. I was having a very similar problem when I was copying files from local to hdfs and reformatting the namenode did not fix the problem for me. It turned out that my namenode logs had the following error message:

2012-07-11 03:55:43,479 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-920118459-192.168.3.229-50010-1341506209533, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Too many open files
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.createNewFile(File.java:883)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:491)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:462)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.createTmpFile(FSDataset.java:1628)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1514)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:113)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:381)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)

Apparently, this is a relatively common problem on hadoop clusters and Cloudera suggests increasing the nofile and epoll limits (if on kernel 2.6.27) to work around it. The tricky thing is that setting nofile and epoll limits is highly system dependent. My Ubuntu 10.04 server required a slightly different configuration for this to work properly, so you may need to alter your approach accordingly.

Overhead answered 11/7, 2012 at 23:54 Comment(0)
C
0

Don't format the name node immediately. Try stop-all.sh and start it using start-all.sh. If the problem persists, go for formatting the name node.

Cognation answered 19/4, 2017 at 7:7 Comment(0)
C
0

Follow the below steps:
1. Stop dfs and yarn.
2. Remove datanode and namenode directories as specified in the core-site.xml.
3. Start dfs and yarn as follows:

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
Cadet answered 4/5, 2017 at 12:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.