How does MongoDB journaling work

Here is my view, and I am not sure if it is right or wrong:

The journaling log is the "redo" log. It records the modification of the data files.

For example, I want to change the field value of one record from 'a' to 'b', then the mongodb will find how to modify the dbfile (include all the namespace, data, index and so on), then mongodb write the modifications to the journal.

After that, mongodb does all the real modifications to the dbfile. If something goes wrong here, when mongoDB restarts it will read the journal (if it exists). It will then change the alter the dbfile to make the data set consistent.

So, in the journal, the data to change is not recorded, but instead how to change the dbfile.

Am I right? where can I get more information about the journal's format?

EDIT: my original link to a 2011 presentation at MongoSF by Dwight is now dead, but there is a 2012 presentation by Ben Becker with similar content.

Just in case that stops working at some point too, I will give a quick summary of how the journal in the original MMAP storage engine worked, but it should be noted that with the advent of the pluggable storage engine model (MongoDB 3.0 and later), this now completely depends on the storage engine (and potentially the options) you are using - so please check.

Back to the original (MMAP) storage engine journal. At a very rudimentary level, the journal contains a series of queued operations and all operations are written into it as they happen - basically an append only sequential write to disk.

Once these operations have been applied and flushed to disk, then they are no longer needed in the journal and can be aged out. In this sense the journal basically acts like a circular buffer for write operations.

Internally, the operations in the journal are stored in "commit groups" - a logical group of write operations. Once an operation is in a complete commit group it can be considered to be synced to disk as part of the journal (and will satisfy the j:true write concern for example). After an unclean shutdown, mongod will attempt to apply all complete commit groups that have not previously been flushed to disk, incomplete commit groups will be discarded.

The operations in the journal are not what you will see in the oplog, rather they are a more simple set of files, offsets (disk locations essentially), and data to be written at the location. This allows for efficient replay of the data, and for a compact format for the journal, but will make the contents look like gibberish to most (as opposed to the aforementioned oplog which is basically readable as JSON documents). This basically answers one of the questions posed - it does not have any awareness of the database file's contents and the changes to be made to it, it is even more simple - it basically only knows to go to disk location X and write data Y, that's it.

The write-ahead, sequential nature of the journal means that it fits nicely on a spinning disk and the sequential access pattern will usually be at odds with the MMAP data access patterns (though not necessarily the access patterns of other engines). Hence it is sometimes a good idea to put the journal on its own disk or partition to reduce IO contention.

Recommended topics

Hot tags