namespace image and edit log

From the book "Hadoop The Definitive Guide", under the topic Namenodes and Datanodes it is mentioned that:

The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.

secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

I am having some confusion with these files namespace and edit log.

Namespace image is for storing the metadata.

So, my questions are

What is the edit log? And what is its role?
Can you explain the statement "Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large."?

Please can anyone explain me what is the edit log? What is the role of this log file?

Initially when the NameNode first starts up the fsimage file will itself be empty. When ever NameNode receives a create/update/delete request then that request is first recorded to edits file for durability once persisted in the edits file an in-memory update is also made. Because all read requests are served from in-memory snapshot of the metadata.

Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

So, you see the edits file keeps on growing with out bounds at this point. Now if the NameNode is restarted or for some reason went down and brought back up, it has no memory representation of the metadata so, it has to read the edits file and rebuild the snapshot in-memory, which might take a while based on the edits file size.

As edits itself is a WAL (write ahead log) all the events have to written one after another (append only), there could be no updates in the file to prevent random disk seeks.

To prevent this overhead (or to keep edits file manageable) SecondaryNameNode was introduced. The sole purpose of the SNN is to make sure the edits file does not grow out of bounds. So, by default SNN triggers a process called as checkpointing when ever edits file reaches 64MB or for every one hour (which ever comes first).

Checkpointing process it self is simple, the SNN tells the NN to role its current edits log and create a new edits files called edits.new, SNN then copies over the fsimage and edits file from NN and starts applying the events in the edits file to already existing fsimage file (brought from NN), once completed the new fsimage file is sent back to NN and the NN replaces the existing fsimage with the new one sent over by SNN and renames the edits.new to edits. The NN now has a current version of fsimage which has events applied from the edits file.

So, that if the NameNode is restarted after checkpointing has been completed, NameNode has to just load the fsimage to memory and apply just the recents updates from edits log (which got filled after the checkpoint has been completed) to make sure it has the up to date view of the namespace which more efficient.

Recommended topics

Hot tags