You need to distinguish between kNN (k Nearest Neighbors) and exact search.
With exact search (i.e. brute-force search by using a script_score
query), if you have 1M vectors, your query vector will be compared against each of them and the results you'll get are the real 10 closest vectors to your query vector.
With kNN search, also called approximate nearest neighbors (ANN) it's a bit different, because your 1M vectors will be indexed in a dedicated structure depending on your vector search engine (Inverted File Index, KD trees, Hierarchical Navigable Small Worlds, etc). For Elasticsearch, which is based on Apache Lucene, vectors are indexed in a Hierarchical Navigable Small Worlds structure. At search time, the HNSW algorithm will try to figure out the k nearest neighbors to your query vector based on their closest distance, or highest similarity. It might find the real ones, or not, hence the approximate nature of these search algorithms. In order to decrease the odds of "or not", the idea is to visit a higher amount of vectors, and that's the role of num_candidates
.
The idea is NOT to pick a value of num_candidates
that is high enough to visit all vectors in your database, as that would boil down to make an exact search and it would make no sense to use an ANN algorithm for this, just run an exact search, pay the execution price and that's it.
The shard sizing document you are referring to does not pertain to kNN search. kNN search has its own tuning strategy that is different. As the HNSW graph needs to be built per segment and each segment needs to be searched, the ideal situation would be to have a single segment to search, i.e. one shard with one force-merged segment. Depending on your data volume and if you're constantly indexing new vectors, it might not be feasible. But you should optimize in that direction, i.e. less shards with less segments as much as possible.
Let's say that you manage to get your 1M vectors into a single shard with a single segment, there's no reason to have a high num_candidates
, because the HNSW search algorithm has a pretty good recall rate and doesn't need to visit more than a certain amount of candidates (to be figured out depending on your use case, constraints, data volume, SLA, etc) in order to find the top k ones.
Update March 27th, 2024:
It is worth noting that as of ES 8.13, k
and num_candidates
have become optional and their respective values are set to sensible default, i.e.:
k
defaults to the value of size
num_candidates
defaults to the minimum of 10000
and 1.5 x k
So by default:
size = k = 10
num_candidates = 15