Stop replica set on mongo and primary goes into recovery status

Asked 31/8, 2016 at 8:16 Answered 8/9, 2016 at 15:8

When I stop nodes of my replica set and start them up again, the primary node goes into status "recovering".

I have a replica set created, running without authorization. In order to use authorization I have added users "db.createUser(...)", and enabled authorization in the configuration file:

security:
   authorization: "enabled"

Before stopping replica set (even restarting cluster without adding security params), rs.status() shows:

{
        "set" : "REPLICASET",
        "date" : ISODate("2016-09-08T09:57:50.335Z"),
        "myState" : 1,
        "term" : NumberLong(7),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.167:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 301,
                        "optime" : {
                                "ts" : Timestamp(1473328390, 2),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T09:53:10Z"),
                        "electionTime" : Timestamp(1473328390, 1),
                        "electionDate" : ISODate("2016-09-08T09:53:10Z"),
                        "configVersion" : 1,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.168:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 295,
                        "optime" : {
                                "ts" : Timestamp(1473328390, 2),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T09:53:10Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T09:57:48.679Z"),
                        "lastHeartbeatRecv" : ISODate("2016-09-08T09:57:49.676Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "192.168.1.167:27017",
                        "configVersion" : 1
                },
                {
                        "_id" : 2,
                        "name" : "192.168.1.169:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 295,
                        "optime" : {
                                "ts" : Timestamp(1473328390, 2),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T09:53:10Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T09:57:48.680Z"),
                        "lastHeartbeatRecv" : ISODate("2016-09-08T09:57:49.054Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "192.168.1.168:27017",
                        "configVersion" : 1
                }
        ],
        "ok" : 1
}

In order to start using this configuration, I have stopped each node as follows:

[root@n--- etc]# mongo --port 27017 --eval 'db.adminCommand("shutdown")'
MongoDB shell version: 3.2.9
connecting to: 127.0.0.1:27017/test
2016-09-02T14:26:15.784+0200 W NETWORK  [thread1] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2016-09-02T14:26:15.785+0200 E QUERY    [thread1] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed :
connect@src/mongo/shell/mongo.js:231:14

After this shutdown, I have confirmed that the process does not exist by checking the output from ps -ax | grep mongo.

But when I start the nodes again and log in with my credentials, rs.status() indicates now:

{
        "set" : "REPLICASET",
        "date" : ISODate("2016-09-08T13:19:12.963Z"),
        "myState" : 3,
        "term" : NumberLong(7),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.167:27017",
                        "health" : 1,
                        "state" : 3,
                        "stateStr" : "RECOVERING",
                        "uptime" : 42,
                        "optime" : {
                                "ts" : Timestamp(1473340490, 6),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T13:14:50Z"),
                        "infoMessage" : "could not find member to sync from",
                        "configVersion" : 1,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.168:27017",
                        "health" : 0,
                        "state" : 6,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T13:19:10.553Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "authenticated" : false,
                        "configVersion" : -1
                },
                {
                        "_id" : 2,
                        "name" : "192.168.1.169:27017",
                        "health" : 0,
                        "state" : 6,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T13:19:10.552Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "authenticated" : false,
                        "configVersion" : -1
                }
        ],
        "ok" : 1
}

Why? Perhaps the shutdown is not a good way to stop mongod; however I also tested using 'kill pid', but the restart ends up in the same state.

In this status I don´t know how to repair the cluster; I have started again (removing the dbpath files and reconfiguring the replica set); I tried '--repair' but has not worked.

Info about my system:

Mongo version: 3.2
I start the process as root, perhaps it should be as 'mongod' user?
This is my start command: mongod --conf /etc/mongod.conf
keyFile configuration does not work; if I add "--keyFile /path/to/file" shows:
"about to fork child process, waiting until server is ready for connections." this file has all permissions, but it cannot use keyFile.
An example of the "net.bindIp" configuration, from mongod.conf on one machine:
```
net:
  port: 27017
  bindIp: 127.0.0.1,192.168.1.167
```

Steffi answered 31/8, 2016 at 8:16 Comment(16)

What did you do after enabling authentication? How your replica set members try to authenticate their membership to their respective replica set? After enabling authentication, you can't connect to an instance without appropriate credentials, unless you're connecting from localhost – Nebula 2/9, 2016 at 9:16

Did you try this docs.mongodb.com/manual/core/security-internal-authentication – Nebula 2/9, 2016 at 9:19

After enabling authentication in all config files I start replicaset (with mongod --config /etc/mongod.conf) and then I access with my credentials that before turning off the cluster I inserted. keyfile authentication its optional, not is my issue. Only I want acces by user/passs – Steffi 2/9, 2016 at 9:35

You should add some extra configuration, because not only clients need to be able to authenticate with the replica set, but replica set nodes also need to be able to authenticate with each other. So, each replica set node authenticates itself with the others as a special internal user with enough privileges – Nebula 2/9, 2016 at 9:43

For example you can add a key file to each replica set member and put the password of admin user in it. Then start the mongod instance with --keyFile /path/to/keyfile arg. – Nebula 2/9, 2016 at 9:45

If I add --keyFile /path/to/keyfile mongo does not initiate, it shows a message "about to fork child process, waiting until server is ready for connections." I must initiate without this property – Steffi 2/9, 2016 at 12:45

Accordind to your log, the "stop" command fails, so we cannot say just anything you said you did is true. Thus it's impossible to answer your question due to no hard evidence on anything. Follow the docs and return if/when you have anything more concrete to show. – Esemplastic 2/9, 2016 at 14:28

then, whats your suggestion about 'shutdown' command? after executing shutdown command process mongod stop. – Steffi 3/9, 2016 at 8:43

Can you show the full status of your replica set? You get it with the command rs.status(). – Double 7/9, 2016 at 16:11

Youe - the replica set status you have posted says everything is healthy. Does that mean it has been repaired and is now fine? Was it simply that the nodes took a while to go through the Recovering state to achieve full health? – Double 8/9, 2016 at 10:31

Before stoping cluster rs.status() is OK, but when I stop/start the mongod services rs.status() shows another state. – Steffi 8/9, 2016 at 13:23

That last status makes it look like only one node is running at all. In that case, there aren't enough nodes to hold an election so the replica set will not be available. What happens when you reboot a second node? Do they manage to contact each other, hold an election and re-establish the replica set in full health? – Double 8/9, 2016 at 13:33

Second node is active in that state, but it is not reachable. Maybe because when I relaunch mongod (with security params) they cannot comunicate between them – Steffi 8/9, 2016 at 14:26

So if you restart the nodes without changing any configuration do they reinstate the replica set successfully? – Double 8/9, 2016 at 16:17

yes, if there is not configuration changes I can start and all states are ok. I writted solution down. – Steffi 8/9, 2016 at 20:36

Sounds like there never was a problem with your replica set then; it was a problem with your authentication after all. – Double 9/9, 2016 at 14:6

Note: This solution is Windows specific but can be ported to *nix based systems easily.

You'll need to take steps in sequence. First of all, start your mongod instances.

start "29001" mongod --dbpath "C:\data\db\r1" --port 29001
start "29002" mongod --dbpath "C:\data\db\r2" --port 29002
start "29003" mongod --dbpath "C:\data\db\r3" --port 29003

Connect with mongo to each node and create an administrator user. I prefer creating super user.

> use admin
> db.createUser({user: "root", pwd: "123456", roles:["root"]})

You may create other users as deemed necessary.

Create key file. See documentation for valid key file contents.

Note: On *nix based systems, set chmod of key file to 400

In my case, I created key file as

echo mysecret==key > C:\data\key\key.txt

Now restart your MongoDB servers with --keyFile and --replSet flags enabled.

start "29001" mongod --dbpath "C:\data\db\r1" --port 29001 --replSet "rs1" --keyFile C:\data\key\key.txt
start "29002" mongod --dbpath "C:\data\db\r2" --port 29002 --replSet "rs1" --keyFile C:\data\key\key.txt
start "29003" mongod --dbpath "C:\data\db\r3" --port 29003 --replSet "rs1" --keyFile C:\data\key\key.txt

Once all mongod instances are up and running, connect any one with authentication.

mongo --port 29001 -u "root" -p "123456" --authenticationDatabase "admin"

Initiate replicaset,

> use admin
> rs.initiate()
> rs1:PRIMARY> rs.add("localhost:29002")
{ "ok" : 1 }
> rs1:PRIMARY> rs.add("localhost:29003")
{ "ok" : 1 }

Note: You may need to replace localhost with machine name or IP address.

Swarey answered 4/9, 2016 at 15:26 Comment(0)

finally I resolved the problem, for a cluster replica set is MANDATORY a keyFile to communicate all nodes, when I indicated keyFile it returns error because in mongod.log indicated :

I ACCESS   [main] permissions on /etc/keyfile are too open

keyfile must have 400 as permission. Thanks @Saleem

When people says "You can add keyfile" I was thinking as an optional param but it is mandatory.

Steffi answered 8/9, 2016 at 15:8 Comment(0)