We've discovered some duplicate documents in one of our Elasticsearch indices and we haven't been able to work out the cause. There are two copies of each of the affected documents, and they have exactly the same _id
, _type
and _uid
fields.
A GET request to /index-name/document-type/document-id
just returns one copy, but searching for the document with a query like this returns two results, which is quite surprising:
POST /index-name/document-type/_search
{
"filter": {
"term": {
"_id": "document-id"
}
}
}
Aggregating on the _uid
field also identifies the duplicate documents:
POST /index-name/_search
{
"size": 0,
"aggs": {
"duplicates": {
"terms": {
"field": "_uid",
"min_doc_count": 2
}
}
}
}
The duplicates are all on different shards. For example, a document might have one copy on primary shard 0 and one copy on primary shard 1. We've verified this by running the aggregate query above on each shard in turn using the preference parameter: it does not find any duplicates within a single shard.
Our best guess is that something has gone wrong with the routing, but we don't understand how the copies could have been routed to different shards. According to the routing documentation, the default routing is based on the document ID, and should consistently route a document to the same shard.
We are not using custom routing parameters that would override the default routing. We've double-checked this by making sure that the duplicate documents don't have a _routing
field.
We also don't define any parent/child relationships which would also affect routing. (See this question in the Elasticsearch forum, for example, which has the same symptoms as our problem. We don't think the cause is the same because we're not setting any document parents).
We fixed the immediate problem by reindexing into a new index, which squashed the duplicate documents. We still have the old index around for debugging.
We haven't found a way of replicating the problem. The new index is indexing documents correctly, and we've tried rerunning an overnight processing job which also updates documents but it hasn't created any more duplicates.
The cluster has 3 nodes, 3 primary shards and 1 replica (i.e. 3 replica shards). minimum_master_nodes
is set to 2, which should prevent the split-brain issue. We're running Elasticsearch 2.4 (which we know is old - we're planning to upgrade soon).
Does anyone know what might cause these duplicates? Do you have any suggestions for ways to debug it?