We use the Google Cloud SQL Database with the Replication and since three days our master - slave replication is constantly lagging behind and not catching up.
What can be the cause and what can we do?
What we did so far:
It started when we realize that our mysql replica in the Google Cloud was out of sync with the master. I looked into the log and noticed that the following error message happened unusually often 3 days ago.
"2018-05-03T08:31:07.851491Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5539ms. The settings might not be optimal. (flushed=289 and evicted=0, during the time.)"
Until now this message still appears in the error log. I googled and found this stackoverflow thread:
It is mentioned that the following setting can help with this problem innodb_lru_scan_depth to 256.
However, our database is in the Google Cloud, where we cannot customize .my.cnf. It's not possible to change the flag mentioned above.
About three days ago we executed a script, which deleted a lot of data in our database. I assume, that it created a lot of "dirty_pages" which are mentioned in the stackoverflow thread I linked above. To get more information I connected to the replica and issued the command
SHOW SLAVE STATUS\G;
display the slave status. Here are some points that seemed suspicious to me
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Host: IPAdress
Master_User: Master
Master_Port: Port
Connect_Retry: 60
Master_Log_File: mysql-bin.021462
Read_Master_Log_Pos: 62489170
Relay_Log_File: relay-log.069147
Relay_Log_Pos: 22557859
Relay_Master_Log_File: mysql-bin.020309
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 22557686
Relay_Log_Space: 121103459150
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File: master_server_ca.pem
Master_SSL_CA_Path: /mysql/datadir
Master_SSL_Cert: replica_cert.pem
Master_SSL_Cipher:
Master_SSL_Key: replica_pkey.pem
Seconds_Behind_Master: 196092
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 3852200383
Master_UUID: 48880dc9-4603-11e7-8ac2-42010af01028
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: System lock
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 48880dc9-4603-11e7-8ac2-42010af01028:414490186-432330581
Executed_Gtid_Set: 48880dc9-4603-11e7-8ac2-42010af01028:1-414777044,9cc92cb1-1a09-11e7-8bcc-42010af00a79:1-277822489
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.04 sec)
ERROR: No query specified
The replica is 196092 seconds behind the master and still rising.
As far as I understood:
- Master_Log_File: mysql-bin.021462 shows the actual bin file of the master
- Relay_Master_Log_File: mysql-bin.020309 shows the actual bin file of the replica.
I checked the status for a while and recognize Master_Log_File is rising faster than Relay_Master_Log_File does.
So does it mean that our Replica will never keep up with the Master?
It looks like the database is processing the Relay_Master_Log_File, but a Binlogfile takes a long time.
In addition, "SHOW PROCESSLIST" indicate that there is a system user that specifies a system lock.
+------+-------------+----------------------------+-----------------+---------+--------+----------------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+------+-------------+----------------------------+-----------------+---------+--------+----------------------------------------+------------------------------------------------------------------------------------------------------+
| 1 | system user | | NULL | Connect | 16023 | Queueing master event to the relay log | NULL |
| 2 | system user | | NULL | Connect | 196577 | System lock | NULL |
What I did next was to give the replica more CPU and memory. Because the utilization of the CPU was 100% throughout. With 8 cores and 30GB memory, the server is currently not fully utilized, but more than half of the resource is still being used. I assume this is the page_cleaner that writes the dirty_pages to the hard disk
I thought this would solve our problem, because the master has about four times of the power before increasing Hardware for Replica. But nothing has changed at the moment.
Does the page_cleaner now break down the dirty_pages and therefore requires a lot of performance so that the synchronization of the replica is very slow?