MongoDB replication timeout

I use MongoDB 3.4.3 and have three machines in one replica set. Let its names as server1, server2 and server3. server2 is in a constant rollback state, so we turned it off. server3 is in recovering state and tries to get oplog from server1 but its attempts result in ExceededTimeLimit exception. So this is an extract from the server3 log:

2017-06-26T14:42:14.442+0300 I REPL     [replication-0] could not find member to sync from
2017-06-26T14:42:24.443+0300 I REPL     [rsBackgroundSync] sync source candidate: server1:27017
2017-06-26T14:42:24.444+0300 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to server1:27017
2017-06-26T14:42:24.455+0300 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to server1:27017
2017-06-26T14:42:54.459+0300 I REPL     [replication-0] Blacklisting server1:27017 due to required optime fetcher error: 'ExceededTimeLimit: Operation timed out, request was RemoteCommand 191739 -- server1:27017 db:local expDate:2017-06-26T14:42:54.459+0300 cmd:{ find: "oplog.rs", oplogReplay: true, filter: { ts: { $gte: Timestamp 1497975676000|310, $lte: Timestamp 1497975676000|310 } } }' for 10s until: 2017-06-26T14:43:04.459+0300. required optime: { ts: Timestamp 1497975676000|310, t: 20 }

So these attepms to retrieve oplog are infinite. According to db.currentOp() there are a log of long running queries on the server1 (the primary of the replica set) trying to retrieve the oplog. These queries descreases perfomance of server1, so my database works very very slow.

The current server1's oplog size is 643 GB. I think its size is the reason why the replication doesn't work. server2 had had oplog timeout issues as well, so we turned it off temporarily. This sutiation has been lasting for more than week. I have more than 5 TB of data on the primary machine. How can I restore the replica set?

upd: Our servers have 64 GB of memory each. It's virtual machines indeed.

Recommended topics

Hot tags