Why we still need innodb redo log when mysql binlog has been enabled?
Asked Answered
S

1

21

In my understanding, mysql binlog can fully function as InnoDB's redo log.

So, after the binlog is enabled, why does InnoDB have to write a redo log at the same time instead of just switching to use the binlog? Doesn't this significantly slow down database write performance?

In addition to simplifying design and implementation, is there any benefit to doing this?

AFAIK, to enable two logs at the same time as the ACID compliance is guaranteed, the following problems will occur:

  1. Each log record with the same meaning must be written twice separately.
  2. Flush two logs each time a transaction or transaction group commits.
  3. To ensure consistency between the two log files, a complex and inefficient way such as XA (2PC) is used.

Therefore, all other products seem to use only one set of logs (SQL Server called Transaction log, ORACLE called redo log, and PostgreSQL called WAL) to do all the relevant work. Is it only MySQL that must open two sets of logs at the same time to ensure both ACID compliance and strong consistent master-slave replication?

Is there a way to implement ACID compliance and strong consistent semi-synchronous replication while only one of them is enabled?

Silverplate answered 18/9, 2019 at 0:25 Comment(0)
C
25

This is an interesting topic. For a long time, I have been advocating the idea of merging the InnoDB write-ahead log and the binlog. The biggest motivation for that would be that the need to synchronize two separate logs would go away. But, I am afraid that this might not happen any time soon.

At MariaDB, we are taking some steps to reduce the fsync() overhead. The idea of MDEV-18959 Engine transaction recovery through persistent binlog is to guarantee that the binlog is never behind the InnoDB redo log, and by this, to allow a durable, crash-safe transaction commit with only one fsync() call, on the binlog file.

While the binlog implements logical logging, the InnoDB redo log implements physical logging (covering changes to persistent data pages that implement undo logs and index trees). As I explained in M|18 Deep Dive: InnoDB Transactions and Write Paths, a user transaction is divided into multiple mini-transactions, each of which can atomically modify multiple data pages.

The redo log is the ‘glue’ that makes changes to multiple data pages atomic. I think that the redo log is absolutely essential for implementing atomic changes of update-in-place data structures. Append-only data file structures, such as LSM trees, could be logs by themselves and would not necessarily need a separate log.

For an InnoDB table that contains secondary indexes, every single row operation is actually divided into multiple mini-transactions, operating on each index separately. Thus, the transaction layer requires more ‘glue’ that makes the indexes of a table consistent with each other. That ‘glue’ is provided by the undo log, which is implemented in persistent data pages.

InnoDB performs changes to the index pages upfront, and commit is a quick operation, merely changing the state of the transaction in the undo log header. But rollback is very expensive, because the undo log will have to be replayed backwards (and more redo log will be written to cover those index page changes).

In MariaDB Server, MyRocks is another transactional storage engine, which does the opposite: Buffer changes in memory until the very end, and at commit, apply them to the data files. This makes rollback very cheap, but the size of a transaction is limited by the amount of available memory. I have understood that MyRocks could be made to work in the way that you propose.

Catherin answered 18/9, 2019 at 8:43 Comment(6)
Hi Marko, thanks for your nice and comprehensive answer. But no matter which ACID compliant Storage Engine you use, binlog will cause extra writes, right? It seems that only by disabling binlog, and replicating directly through redo log can completely avoid these extra overheads? I found this: medium.com/@Alibaba_Cloud/… Are similar features likely to appear in the mainstream distributions such as mariadb?Silverplate
@Silverplate You are right. I have for years advocated the idea of writing all log events to a single file. In jira.mariadb.org/browse/MDEV-12353 I am working on a new InnoDB redo log format that allows ‘foreign’ log events to be added. The simpler code for parsing and applying log could also make it easier to implement physical replication (without binlog).Bedivere
I think this is a great improvement and I have to vote for it :-)Silverplate
Another way of using a single log file would be to store the binlog in an InnoDB table, instead of storing it in separate files. In this way, an InnoDB log checkpoint would not discard any binlog, just like it does not discard any undo log (which is stored in log-covered data pages). However, I do not have high hopes of that being implemented any time soon.Bedivere
@MarkoMäkelä Just wondering how are you planning to solve below 2 with single physical log file/WAL? 1. Replication now needs to work on physical addresses of things. While this should be okay as all nodes in cluster are generally homogenous and can work on simple file-offset based info, but will this break any heterogenous clusters? 2. Binlog also provides a straightforward, commit ordered change streams from client perspective. Now do you run the risk of exposing some low level details to client?Tonkin
@SauravPrakash I think that for replication, a logical format provides most flexibility, and I see no need to change that. Changing the storage location or layout of a log is independent of the format of log records (in this case, binlog events). I think that for any binlog storage layout to be useful, it must preserve the commit order, for example by assigning a suitable primary key. Deep inside InnoDB there is a total order imposed by the LSN (log sequence number), which is analogous to the Oracle SCN (system change number). That hardly needs to be exposed to any clients.Bedivere

© 2022 - 2024 — McMap. All rights reserved.