I'm running a hadoop cluster with 24 servers. It has been running for some months, but after the last reboot the datanodes keep dying with the error:
2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/
every time I restart the cluster it starts well, with all the nodes on. but after some seconds running a map reduce job some nodes die with that error. Every time the dead nodes are different.
Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I told, the cluster has been running before for months without problems.
Thanks.