MongoDB replication timeout
Asked Answered
C

1

0

I use MongoDB 3.4.3 and have three machines in one replica set. Let its names as server1, server2 and server3. server2 is in a constant rollback state, so we turned it off. server3 is in recovering state and tries to get oplog from server1 but its attempts result in ExceededTimeLimit exception. So this is an extract from the server3 log:

2017-06-26T14:42:14.442+0300 I REPL     [replication-0] could not find member to sync from
2017-06-26T14:42:24.443+0300 I REPL     [rsBackgroundSync] sync source candidate: server1:27017
2017-06-26T14:42:24.444+0300 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to server1:27017
2017-06-26T14:42:24.455+0300 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to server1:27017
2017-06-26T14:42:54.459+0300 I REPL     [replication-0] Blacklisting server1:27017 due to required optime fetcher error: 'ExceededTimeLimit: Operation timed out, request was RemoteCommand 191739 -- server1:27017 db:local expDate:2017-06-26T14:42:54.459+0300 cmd:{ find: "oplog.rs", oplogReplay: true, filter: { ts: { $gte: Timestamp 1497975676000|310, $lte: Timestamp 1497975676000|310 } } }' for 10s until: 2017-06-26T14:43:04.459+0300. required optime: { ts: Timestamp 1497975676000|310, t: 20 }

So these attepms to retrieve oplog are infinite. According to db.currentOp() there are a log of long running queries on the server1 (the primary of the replica set) trying to retrieve the oplog. These queries descreases perfomance of server1, so my database works very very slow.

The current server1's oplog size is 643 GB. I think its size is the reason why the replication doesn't work. server2 had had oplog timeout issues as well, so we turned it off temporarily. This sutiation has been lasting for more than week. I have more than 5 TB of data on the primary machine. How can I restore the replica set?

upd: Our servers have 64 GB of memory each. It's virtual machines indeed.

Caucasia answered 28/6, 2017 at 9:25 Comment(0)
S
1

Can you have downtime? Because it looks like that your machine (server1) don't have enough memory. With 5TB data and that big opLog, needed memory amount is hundreds of GB. I would not try to run that system as one replica set. More like 3-5 shards cluster (totally 9-15 nodes; replica set of 3 for every shard). Good rule is keep node size always under 2TB and 1TB is good starting point if you can archive that.

If you can have downtime, you should shrink your opLog to more reasonable size. You could start with 50GB. Steps can be found here.

Suitor answered 29/6, 2017 at 7:35 Comment(2)
Our servers have 64 GB of memory each. It's virtual machines indeed.Caucasia
We have 10*1,1TB cluster.. With 120GB memory on each.. We made "test" with 64GB, but performance was not good enough.Suitor

© 2022 - 2024 — McMap. All rights reserved.