I've got a mid-size elasticsearch index (1.46T or ~1e8 docs). It's running on 4 servers which each have 64GB Ram split evenly between elastic and the OS (for caching).
I want to try out the new "Significant terms" aggregation so I fired off the following query...
{
"query": {
"ids": {
"type": "document",
"values": [
"xCN4T1ABZRSj6lsB3p2IMTffv9-4ztzn1R11P_NwTTc"
]
}
},
"aggregations": {
"Keywords": {
"significant_terms": {
"field": "Body"
}
}
},
"size": 0
}
Which should compare the body of the document specified with the rest of the index and find terms significant to the document that are not common in the index.
Unfortunately, this invariably results in a
ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: CircuitBreakingException[Data too large, data would be larger than limit of [25741911654] bytes];
after a minute or two and seems to imply I haven't got enough memory.
The elastic servers in question are actually VMs, so I shut down other VMs and gave each elastic instance 96GB and each OS another 96GB.
The same problem occurred (different numbers, took longer). I haven't got hardware to hand with more than 192GB of memory available so can't go higher.
Are aggregations not meant for use against the index as a whole? Am I making a mistake with regards to the query format?