Hadoop safemode recovery - taking too long!
Asked Answered
S

2

28

I have a Hadoop cluster with 18 data nodes. I restarted the name node over two hours ago and the name node is still in safe mode.

I have been searching for why this might be taking too long and I cannot find a good answer. The posting here: Hadoop safemode recovery - taking lot of time is relevant but I'm not sure if I want/need to restart the name node after making a change to this setting as that article mentions:

<property>
 <name>dfs.namenode.handler.count</name>
 <value>3</value>
 <final>true</final>
</property>

In any case, this is what I've been getting in 'hadoop-hadoop-namenode-hadoop-name-node.log':

2011-02-11 01:39:55,226 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020, call delete(/tmp/hadoop-hadoop/mapred/system, true) from 10.1.206.27:54864: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 319128 needs additional 7183 blocks to reach the threshold 0.9990 of total blocks 326638. Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 319128 needs additional 7183 blocks to reach the threshold 0.9990 of total blocks 326638. Safe mode will be turned off automatically.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1711)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1691)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:565)
    at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)

Any advice is appreciated. Thanks!

Settles answered 11/2, 2011 at 7:28 Comment(3)
What's your replication factor?Samp
Replication factor is 3. And it's still in safe mode!Settles
K, yeah you should definitly go for a higher handler count, should be arround 10.Samp
W
45

I had it once, where some blocks were never reported in. I had to forcefully let the namenode leave safemode (hadoop dfsadmin -safemode leave) and then run an fsck to delete missing files.

Wilona answered 16/2, 2011 at 19:4 Comment(4)
I ended up having to run '-safemode leave' also after waiting several hours. There are missing blocks still so I will need to run fsk to delete missing files also.Settles
Do you know a reason hdfs doesn't restore missing replicas itself?Rookie
Then use hadoop fsck -delete to clean the data.Gremlin
@xinit, @shane, @senile_genius, @Denis, I use hadoop dfsadmin -safemode leave, then all the block information for the files on the datanode can not be found. Even on the web interface nn1, the file names are still there, the files can not be downloaded since the block information can not be found anymore. How to solve this problem?Disability
D
0

Check the properties dfs.namenode.handler.count in hdfs-site.xml.

dfs.namenode.handler.count in hdfs-site.xml specifies the number of threads used by Namenode for it’s processing. its default value is 10. Too low value of this properties might cause the issue specified.

Also check the missing or corrupt blocks hdfs fsck / | egrep -v '^.+$' | grep -v replica

hdfs fsck /path/to/corrupt/file -locations -blocks -files

if the corrupt blocks are found, remove it. hdfs fs -rm /file-with-missing-corrupt blocks.

Delarosa answered 15/8, 2019 at 6:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.