How to successfully complete a namenode restart with 5TB worth of edit files to process

Asked 12/7, 2018 at 21:9 Answered 24/7, 2018 at 20:14

I have a namenode that had to be brought down for an emergency that has not had an FSImage taken for 9 months and has about 5TB worth of edit files to process in its next restart. The secondary namenode has not been running (or had any checkpoint operations performed) since about 9 months ago, thus the 9 month old FSImage.

There are about 7.8 million inodes in the HDFS cluster. The machine has about 260GB of total memory.

We've tried a few different combinations of Java heap size, GC algorithms, etc... but have not been able to find a combination that allows the restart to complete without eventually slowing down to a crawl due to FGCs.

I have 2 questions: 1. Has anyone found a namenode configuration that allows this large of an edit file backlog to complete successfully?

An alternate approach I've considered is restarting the namenode with only a manageable subset of the edit files present. Once the namenode comes up and creates a new FSImage, bring it down, copy the next subset of edit files over, and then restart it. Repeat until it's processed the entire set of edit files. Would this approach work? Is it safe to do, in terms of the overall stability of the system and the file system?

Julide answered 12/7, 2018 at 21:9 Comment(6)

Do you have a secondary Name Node in your cluster? – Waken 12/7, 2018 at 22:4

We do have one configured, but it is not currently running. The last time it was brought up was actually about 9 months ago when the last FSImage was taken, it looks like. – Julide 12/7, 2018 at 22:15

That's too bad. The job of the SNN is to keep the edits down to a manageable size. – Waken 12/7, 2018 at 22:22

If your host for the SNN is not the same as the NN host and has the same memory, then you could try restarting it and letting it process the backlog. blog.cloudera.com/blog/2014/03/…. It should basically do the same as what you want in approach 2. – Waken 13/7, 2018 at 19:52

Would the SNN attempt to process the entire backlog of edit files before attempting to create another FSImage? If so, it would probably run out of memory also, I would think. Or are you suggesting to allow the SNN to work on a subset of the edit files at a time, like what I was thinking for the NN? – Julide 14/7, 2018 at 18:29

No, you can specify the number of edits via dfs.namenode.checkpoint.txns. The SNN will checkpoint every dfs.namenode.checkpoint.txns edits. See github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/…. You should bump up dfs.namenode.num.checkpoints.retained to something higher (defaults to 2) just in case, too. – Waken 16/7, 2018 at 18:29

We were able to get through the 5TB backlog of edits files using a version of what I suggested in my question (2) on the original post. Here is the process we went through:

Solution:

Make sure that the namenode is "isolated" from the datanodes. This can be done by either shutting down the datanodes, or just removing them from the slaves list while the namenode is offline. This is done to keep the namenode from being able to communicate with the datanodes before the entire backlog of edits files is processed.
Move the entire set of edits files to a location outside of what is configured on the dfs.namenode.name.dir property of the namenode's hdfs-site.xmlfile.
Move (or copy if you would like to maintain a backup) the next subset of edits files to be processed to the dfs.namenode.name.dir location. If you are not familiar with the naming convention for the FSImage and edits files, take a look at the example below. It will hopefully clarify what is meant by next subset of edits files.
Update file seen_txid to contain the value of the last transaction represented by the last edits file from the subset you copied over in step (3). So if the last edits file is edits_0000000000000000011-0000000000000000020, you would want to update the value of seen_txid to 20. This essentially fools the namenode into thinking this subset is the entire set of edits files.
Start up the namenode. If you take a look at the Startup Progress tab of the HDFS Web UI, you will see that the namenode will start with the latest present FSImage, process through the edits files present, create a new FSImage file, and then go into safemode while it waits for the datanodes to come online.
Bring down the namenode
There will be edits_inprogress_######## file created as a placeholder by the namenode. Unless this is the final set of edits files to process, delete this file.
Repeat steps 3-7 until you've worked through the entire backlog of edits files.
Bring up datanodes. The namenode should get out of safemode once it's been able to confirm the location of a number of data blocks.
Set up a secondary namenode, or high availability for your cluster, so that the FSImage will periodically get created from now on.

Example:

Let's say we have FSImage fsimage_0000000000000000010 and a bunch of edits files: edits_0000000000000000011-0000000000000000020 edits_0000000000000000021-0000000000000000030 edits_0000000000000000031-0000000000000000040 edits_0000000000000000041-0000000000000000050 edits_0000000000000000051-0000000000000000060 ... edits_0000000000000000091-0000000000000000100

Following the steps outlined above:

All datanodes brought offline.
All edits files copied from dfs.namenode.name.dir to another location, ex: /tmp/backup
Let's process 2 files at a time. So copy edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030 over to the dfs.namenode.name.dir location.
Update seen_txid to contain a value of 30 since this is the last transaction we will be processing during this run.
Start up the namenode, and confirm through the HDFS Web UI's Startup Progress tab that it correctly used fsimage_0000000000000000010 as a starting point and then processed edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030. It then created a new FSImage file fsimage_0000000000000000030` and entered safemode, waiting for the datanodes to come up.
Bring down the namenode
Delete the placeholder file edits_inprogress_######## since this is not the final set of edits files to be processsed.
Proceed with the next run and repeat until all edits files have been processed.

Julide answered 24/7, 2018 at 20:14 Comment(0)

If your hadoop is HA enabled, then StandBy NN should have taken care of this, in case of non-HA your secondary NN.

Check logs of these namenode processes as why it is not able to merge/fail.

These below parameters drive your edit files save, and it shouldnt have created these many files.

dfs.namenode.checkpoint.period
dfs.namenode.checkpoint.txns

other way to manually perform the merge but this would be temporary.

hdfs dfsadmin -safemode enter
hdfs dfsadmin -rollEdits
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave

Running above command should merge and save the namespaces.

Fixate answered 13/7, 2018 at 11:12 Comment(3)

I edited the original posting to mention that the SNN has not been running (or had a checkpoint operation executed) in about 9 months. That is why the most recent FSImage is also that old. – Julide 13/7, 2018 at 18:16

For the temporary operations that you mentioned, those would have to be run with the NN running, right? – Julide 13/7, 2018 at 18:17

If you have started the SNN it would automatically merge FS image. yes the commands which i have mentioned needs to be executed on active NN. – Fixate 16/7, 2018 at 5:41

Solution:

Example:

Recommended topics

Hot tags