I have a 5-node SolrCloud (Solr 7.0) with an external 3-node Zookeeper ensemble. There is one collection called "production" that is sharded to 5 shards with a replication factor of 5. See the screenshot below:
shard5 was struggling to elect a new leader for a long time and other cores were complaining with the following error:
azsolr1 solr: 2018-08-28 19:32:43.575 ERROR (qtp1124317168-9304) [c:production s:shard2 r:core_node9 x:production_shard2_replica_n4] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: production slice: shard5
After restarting all nodes one by one (I even restarted the zookeeper nodes), I had no luck in electing the only active replica (azsolr1) as the leader. I then unloaded the 4 replicas with the 'down' state using the CoreAdmin API UNLOAD command which caused the replicas to disappear completely.
With that setup, trying to force the leader of the shard using the Collection API FORCELEADER does nothing. I also tried this before unloading the cores.
Here is the current status:
Why can't Solr just elect the only active replica for shard 5 as the leader? Isn't this obvious, especially after forcing the leader on the shard?
Assuming the leader was elected successfully somehow, do I recreate the replicas that I deleted using the Collection API ADDREPLICA? In this case, should I reuse the same instanceDir
and dataDir
of the deleted replicas? Or I just let it replicate from scratch?