How to Optimize elasticsearch percolator index Memory Performance

Is there a way to improve memory performance when using an elasticsearch percolator index?

I have created a separate index for my percolator. I have roughly 1 000 000 user created saved searches (for email alerts). After creating this percolator index, my heap usage spikes to 100% and the server became unresponsive for any queries. I have somewhat limited resources and am not able to simply throw more RAM at the problem. The only solution was to delete the index containing my saved searches.

From what I have read the percolator index resides in-memory permanently. Is this entirely necessary? Is there a way to throttle this behaviour but still preserve the functionality? Is there a way to optimize my data/queries/index structure to circumvent this behaviour while still achieving the desired result?

There is no resolution to this issue from an ElasticSearch point of view nor is one likely. I have chatted to the ElasticSearch guys directly and their answer is: "throw more hardware at it".

I have however found a way to solve this problem in terms of mitigating my usage of this feature. When I analyzed my saved search data I discovered that my searches consisted of around 100 000 unique keyword searches along with various filter permutations creating over 1 000 000 saved searches.

If I look at the filters they are things like:

Location - 300+
Industry - 50+
etc...

Giving a solution space of:

100 000 * >300 * >50 * ... ~= > 1 500 000 000

However if I were to decompose the searches and index the keyword searches and filters separately in the percolator index, I end up with far less searches:

100 000 + >300 + >50 + ... ~= > 100 350

And those searches themselves are smaller and less complicated than the original searches.

Now I create a second (non-percolator) index listing all 1 000 000 saved searches and including the ids of the search components from the percolator index.

Then I percolate a document and then do a second query filtering the searches against the keyword and filter percolator results. I'm even able to preserve the relevance score as this is returned purely from the keyword searches.

This approach will significantly reduce my percolator index memory footprint while serving the same purpose.

I would like to invite feedback on this approach (I haven't tried it yet but I will keep you posted).

Likewise if my approach is successful do you think it is worth a feature request?

Recommended topics

Hot tags