How do I fix a PostgreSQL 9.3 Slave that Cannot Keep Up with the Master?

Asked 28/1, 2015 at 20:11 Answered 18/2, 2020 at 6:59

We have a master-slave replication configuration as follows.

On the master:

postgresql.conf has replication configured as follows (commented line taken out for brevity):

max_wal_senders = 1            
wal_keep_segments = 8

On the slave:

Same postgresql.conf as on the master. recovery.conf looks like this:

standby_mode = 'on'
primary_conninfo = 'host=master1 port=5432 user=replication password=replication'
trigger_file = '/tmp/postgresql.trigger.5432'

When this was initially setup, we performed some simple tests and confirmed the replication was working. However, when we did the initial data load, only some of the data made it to the slave.

Slave's log is now filled with messages that look like this:

< 2015-01-23 23:59:47.241 EST >LOG:  started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:47.241 EST >FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000F00000052 has already been removed

< 2015-01-23 23:59:52.259 EST >LOG:  started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:52.260 EST >FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000F00000052 has already been removed

< 2015-01-23 23:59:57.270 EST >LOG:  started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:57.270 EST >FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000F00000052 has already been removed

After some analysis and help on the #postgresql IRC channel, I've come to the conclusion that the slave cannot keep up with the master. My proposed solution is as follows.

On the master:

Set max_wal_senders=5
Set wal_keep_segments=4000 . Yes I know it is very high, but I'd like to monitor the situation and see what happens. I have room on the master.

On the slave:

Save configuration files in the data directory (i.e. pg_hba.conf pg_ident.conf postgresql.conf recovery.conf)
Clear out the data directory (rm -rf /var/lib/pgsql/9.3/data/*) . This seems to be required by pg_basebackup.
Run the following command: pg_basebackup -h master -D /var/lib/pgsql/9.3/data --username=replication --password

Am I missing anything ? Is there a better way to bring the slave up-to-date w/o having to reload all the data ?

Any help is greatly appreciated.

Calliope answered 28/1, 2015 at 20:11 Comment(2)

You've really answered your own question -- set wal_keep_segments high enough to allow the slave to catch up after a burst of wal updates. – Bathyal 28/1, 2015 at 20:26

What about recreating the slave -- is my proposed procedure valid ? – Calliope 28/1, 2015 at 21:21

The two important options for dealing with the WAL for streaming replication:

wal_keep_segments should be set high enough to allow a slave to catch up after a reasonable lag (e.g. high update volume, slave being offline, etc...).
archive_mode enables WAL archiving which can be used to recover files older than wal_keep_segments provides. The slave servers simply need a method to retrieve the WAL segments. NFS is the simplest method, but anything from scp to http to tapes will work so long as it can be scripted.
```
# on master
archive_mode = on
archive_command = 'cp %p /path_to/archive/%f' 

# on slave
restore_command = 'cp /path_to/archive/%f "%p"'
```
When the slave can't pull the WAL segment directly from the master, it will attempt to use the restore_command to load it. You can configure the slave to automatically remove segments using the archive_cleanup_commandsetting.

If the slave comes to a situation where the next WAL segment it needs is missing from both the master and the archive, there will be no way to consistently recover the database. The only reasonable option then is to scrub the server and start again from a fresh pg_basebackup.

Bathyal answered 29/1, 2015 at 16:29 Comment(0)

You can configure replication slots for postgress to keep WAL segments for replica mentioned in such slot.

On master server run

SELECT pg_create_physical_replication_slot('standby_slot');

On slave server add next line to recovery.conf

primary_slot_name = 'standby_slot'

Acidulant answered 4/12, 2019 at 13:7 Comment(0)

actually to recover, you don't have to drop the whole DB and start from scratch. since master has up-to-date binary, you can do following to recover the slave and bring them back to in-sync:

psql -c "select pg_start_backup('initial_backup');"
rsync -cva --inplace --exclude=*pg_xlog* <data_dir> slave_IP_address:<data_dir>
psql -c "select pg_stop_backup();"

Note:
1. slave has to be turned down by service stop
2. master will turn to read-only due to query pg_start_backup
3. master can continue serving read only queries
4. bring back slave at the end of the steps

I did this in prod, it works perfect for me. slave and master are in sync and there is no data loss.

Rouleau answered 20/9, 2018 at 21:40 Comment(0)

You will get that error if keep_wal_segments setting is too low. When you set the value for keep_wal_segments consider that "How long is the pg_basebackup taking?"

Remember that segments are generated about every 5 minutes, so if the backup takes an hour, you need at least 12 segments saved. At 2 hours, you need 24, etc. I would set the value to about 12.2 segments/hour of backup.

Diploid answered 18/2, 2020 at 6:59 Comment(0)

As Ben Grimm suggested in the comments, this is a question of making sure to set segments to the maximum possible value to allow the slave to catch up.

Calliope answered 29/1, 2015 at 14:55 Comment(0)

Recommended topics

Hot tags