Google SQL replica is lagging behind master

We use the Google Cloud SQL Database with the Replication and since three days our master - slave replication is constantly lagging behind and not catching up.

What can be the cause and what can we do?

What we did so far:

It started when we realize that our mysql replica in the Google Cloud was out of sync with the master. I looked into the log and noticed that the following error message happened unusually often 3 days ago.

"2018-05-03T08:31:07.851491Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5539ms. The settings might not be optimal. (flushed=289 and evicted=0, during the time.)"

Until now this message still appears in the error log. I googled and found this stackoverflow thread:

How to solve mysql warning: "InnoDB: page_cleaner: 1000ms intended loop took XXX ms. The settings might not be optimal "?

It is mentioned that the following setting can help with this problem innodb_lru_scan_depth to 256.

However, our database is in the Google Cloud, where we cannot customize .my.cnf. It's not possible to change the flag mentioned above.

About three days ago we executed a script, which deleted a lot of data in our database. I assume, that it created a lot of "dirty_pages" which are mentioned in the stackoverflow thread I linked above. To get more information I connected to the replica and issued the command

SHOW SLAVE STATUS\G;

display the slave status. Here are some points that seemed suspicious to me

        *************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: IPAdress
                  Master_User: Master
                  Master_Port: Port
                Connect_Retry: 60
              Master_Log_File: mysql-bin.021462
          Read_Master_Log_Pos: 62489170
               Relay_Log_File: relay-log.069147
                Relay_Log_Pos: 22557859
        Relay_Master_Log_File: mysql-bin.020309
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 22557686
              Relay_Log_Space: 121103459150
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File: master_server_ca.pem
           Master_SSL_CA_Path: /mysql/datadir
              Master_SSL_Cert: replica_cert.pem
            Master_SSL_Cipher:
               Master_SSL_Key: replica_pkey.pem
        Seconds_Behind_Master: 196092
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 3852200383
                  Master_UUID: 48880dc9-4603-11e7-8ac2-42010af01028
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: System lock
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 48880dc9-4603-11e7-8ac2-42010af01028:414490186-432330581
            Executed_Gtid_Set: 48880dc9-4603-11e7-8ac2-42010af01028:1-414777044,9cc92cb1-1a09-11e7-8bcc-42010af00a79:1-277822489
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.04 sec)

ERROR: No query specified

The replica is 196092 seconds behind the master and still rising.

As far as I understood:

Master_Log_File: mysql-bin.021462 shows the actual bin file of the master
Relay_Master_Log_File: mysql-bin.020309 shows the actual bin file of the replica.

I checked the status for a while and recognize Master_Log_File is rising faster than Relay_Master_Log_File does.

So does it mean that our Replica will never keep up with the Master?

It looks like the database is processing the Relay_Master_Log_File, but a Binlogfile takes a long time.

In addition, "SHOW PROCESSLIST" indicate that there is a system user that specifies a system lock.

+------+-------------+----------------------------+-----------------+---------+--------+----------------------------------------+------------------------------------------------------------------------------------------------------+
| Id   | User        | Host                       | db              | Command | Time   | State                                  | Info                                                                                                 |
+------+-------------+----------------------------+-----------------+---------+--------+----------------------------------------+------------------------------------------------------------------------------------------------------+
|    1 | system user |                            | NULL            | Connect |  16023 | Queueing master event to the relay log | NULL                                                                                                 |
|    2 | system user |                            | NULL            | Connect | 196577 | System lock                            | NULL                                                                                                 |

What I did next was to give the replica more CPU and memory. Because the utilization of the CPU was 100% throughout. With 8 cores and 30GB memory, the server is currently not fully utilized, but more than half of the resource is still being used. I assume this is the page_cleaner that writes the dirty_pages to the hard disk

I thought this would solve our problem, because the master has about four times of the power before increasing Hardware for Replica. But nothing has changed at the moment.

Does the page_cleaner now break down the dirty_pages and therefore requires a lot of performance so that the synchronization of the replica is very slow?

Recommended topics

Hot tags