EDIT: my original link to a 2011 presentation at MongoSF by Dwight is now dead, but there is a 2012 presentation by Ben Becker with similar content.
Just in case that stops working at some point too, I will give a quick summary of how the journal in the original MMAP storage engine worked, but it should be noted that with the advent of the pluggable storage engine model (MongoDB 3.0 and later), this now completely depends on the storage engine (and potentially the options) you are using - so please check.
Back to the original (MMAP) storage engine journal. At a very rudimentary level, the journal contains a series of queued operations and all operations are written into it as they happen - basically an append only sequential write to disk.
Once these operations have been applied and flushed to disk, then they are no longer needed in the journal and can be aged out. In this sense the journal basically acts like a circular buffer for write operations.
Internally, the operations in the journal are stored in "commit groups" - a logical group of write operations. Once an operation is in a complete commit group it can be considered to be synced to disk as part of the journal (and will satisfy the j:true
write concern for example). After an unclean shutdown, mongod
will attempt to apply all complete commit groups that have not previously been flushed to disk, incomplete commit groups will be discarded.
The operations in the journal are not what you will see in the oplog, rather they are a more simple set of files, offsets (disk locations essentially), and data to be written at the location. This allows for efficient replay of the data, and for a compact format for the journal, but will make the contents look like gibberish to most (as opposed to the aforementioned oplog which is basically readable as JSON documents). This basically answers one of the questions posed - it does not have any awareness of the database file's contents and the changes to be made to it, it is even more simple - it basically only knows to go to disk location X and write data Y, that's it.
The write-ahead, sequential nature of the journal means that it fits nicely on a spinning disk and the sequential access pattern will usually be at odds with the MMAP data access patterns (though not necessarily the access patterns of other engines). Hence it is sometimes a good idea to put the journal on its own disk or partition to reduce IO contention.