Maintaining mirrored database of a MongoDB replica set

Asked 17/12, 2014 at 9:42 Answered 27/12, 2014 at 22:33

Solved mongodb synchronization mirror replay

We are running a 3-member MongoDB replica set in production environment.

We would need to maintain a clone of that replset, which is called "mirror," to do internal analytics. This mirror does not need to be real-time but the more it is up-to-date the better it is (could be 1-day lagged at max).

What would be the most appropriate methods to maintain such a mirrored database? (Note that this mirror can be either 1-member replset or standalone instance)

FYI, we have tried 2 options but their speed was not acceptable:

Oplog replaying. But this took so much time (~40 hours to play oplog from the replset's Primary).
Periodically using snapshot from production replset but the new volume (created from snapshot) was so slow because it was not warmed up (we are using AWS EBS, the warming up took ~12 hours)

Update #1: We also tried to make the mirror to be the replset member but we wanted to separate the mirror from the replset so this options does not satisfy the requirements.

Update #2: The reason why we do not want this mirror to be a replset member: We ran heavy queries on this mirror and made it run out of resource credits (disk IO, network IO, CPU) and the instance became temporarily unavailable. This changed the whole replset structure (because it lost one node). When the instance was available again, it changed the replset structure again (added one more node). These changes badly affected the replset.

Thank you.

Exemplum answered 17/12, 2014 at 9:42 Comment(6)

Use delayed replica set? (docs.mongodb.org/manual/tutorial/…) You can choose the delay. – Morceau 17/12, 2014 at 10:16

Thansk. Is there any option to separate the mirror? A delayed member still belongs to the replset. – Exemplum 17/12, 2014 at 10:30

Why don't you want it to belong to the replica set? – Morceau 17/12, 2014 at 10:51

Because we ran heavy queries on this mirror and made it run out of resource credits (disk IO, network IO, CPU) and the instance became temporarily unavailable. This changed the whole replset structure (because it lost one node). When the instance was available again, it changed the replset structure again (added one more node). These changes badly affected the replset. That is why I do not want the mirror to be member of replset. – Exemplum 18/12, 2014 at 2:12

did you set its priority to 0 and votes to 0? and hidden to true? this way it does not change the number of voting members or quorum or majority it takes to elect a new primary. – Leftover 26/12, 2014 at 23:29

I do not clearly remember but I should try it and evaluate. Thank you. – Exemplum 29/12, 2014 at 3:51

You could use a "hidden secondary" as explained here: http://docs.mongodb.org/manual/tutorial/configure-a-hidden-replica-set-member/

We use those in a sharded replica environment (4 shards, multiple secondaries per shard) to do our backups. We shutdown the hidden secondaries, take snapshots of the file system and start the machines after that. Never had problems on the production cluster during/after backups. Depending on your needs you can set the delay to a customized time so the replica is either live or has the configured delay.

Update: To explain why I'm so sure this will work: Our cluster does (in MongoDB scale) really heavy lifting with huge M/R jobs, high insert, update and query rates and a total DB size of around 10TB. All on fairly small EC2 instances. We can shutdown our backup secondaries without any problems in any state of the production cluster. We do our backups more than 5 times a day for more than a year now and did several tests with the architecture. Never saw any problems on the production cluster. As our application is really latency sensitive we would see a huge impact in our system if there is any kind of latency impact during backups.

Ss answered 27/12, 2014 at 22:33 Comment(2)

I've looked for the best method for quite a while too. This is it. The fact that it has a 0-priority also means it cannot participate in votes and f* your primary if it ever needs to restart. I have a smaller company and can only afford 2 nodes, a master and a backup-slave. Before I found this it was nerve wrecking to restart the master and see it become secondary with slaveOk = false, byebye site. With mongodump you don't even have to shut it down, just make sure to capture the oplog as well. – Thumbnail 27/12, 2014 at 22:38

We used to have the same suggested replset but when the Hidden became unavailable (due to resource credits running out) the whole replset was slowed down significantly. FYI, we had "hidden:true" and "priority:0". I am not sure that it was because of we were lack of "votes:0"? (see @asya-kamsky 's commend under my original question) – Exemplum 29/12, 2014 at 9:52

You can setup mongodb to make read preference to defined nodes: http://docs.mongodb.org/manual/core/read-preference/#tag-sets, http://docs.mongodb.org/manual/tutorial/configure-replica-set-tag-sets/. Using tags is not complicated, and is quite good alternative to "nearest" read preference.

So you could make this "mirror" as a slave member for the replica set, and use tag "production", for your production clients to read from production secondary nodes, and use special tag "mirror" for this "mirror" instance only in case you need to read from this instance. The mirror instance that way will be the full member of replica and will be constantly updated. Delayed replica set member for this "mirror" instance does also make sense in this case.

However there is a little thing to consider:

When the read preference includes a tag set, the client attempts to find secondary members that match the specified tag set and directs reads to a random secondary from among the nearest group. If no secondaries have matching tags, the read operation produces an error. [1]

In any case, I would have tried to do so in your place.

P.S. one important thing about gathering statistics and analytic on your collections on MongoDB. Mongodb Experts in those courses recommend to \store such statistics as counts etc. during write operations: It means, if you have some users collection you must to count some posts for each user or some other statistical stuff, series of write with $inc to some counter*** fields will smear the load on the database and overall performance will be better then if you use complicated aggregation requests each time you need to count something or to get average or do similar statistics requests from the db.

Villeneuve answered 26/12, 2014 at 13:53 Comment(2)

Thank you. This sounds like we need to make changes to our applications for the usage of tags. Is that correct? That is something we do not really want to do. – Exemplum 29/12, 2014 at 10:1

Yes, unfortunately you do. However those changes are mega-easy to apply, In case of Java its like set your own read preference and that's it: docs.mongodb.org/ecosystem/drivers/java-replica-set-semantics/…. In other languages I hope it will be easy too. And this is widely accepted best practice for spreading data among remote data-centers, for backup and for diversification on data processing tasks. – Villeneuve 29/12, 2014 at 12:58

Recommended topics

Hot tags