I'm writting a predictive search that, for server performance requisites (all is cached), must be run on the client browser. The items are TV shows and movies and are matched by title, actors and directors names. Once the search has been executed, it returns a list of matched items with two values for each result:
The number of matching words (n): The user can type 4 words but only 2 of them matched an item. The more the better.
The Levenshtein Edit Distance addition (ld). The user can type 3 words but 2 of them have a typo or other small diference with the indexed ones. I use the edit distance to find nearest indexed word. The adition of all the Levenshtein Distances is returned as a proximity indicator. The less the better.
Requirements
Client-side. No Sphinx, Lucene or any other server-side solutions.
Speed over accuracy. The algorithm runs on each keystroke, and we don't want to bore the user. Keep the big O not so big.
Non-recursive. The calculation of each item relevance shouldn't be dependant to the other itmes calculation. I don't want to beat Google, only provide first the best result of a small set.
Bounded form 0 to 1, 0 to 100 or something like this. Is not a requisite, but be able to show a "relevance percent" is a plus.
Ideas over Implementations. I'm looking for an algorithm/formula better that for a specific implementation.
My aproach
Based on the exponential decay (like the radioactive half-life decomposition) I made up this formula.
Where:
T
is the number of words the user provided.n
is the number of matching words.ld
is the Levenshtein distance addition for this matching words.
In pseudocode.
function lambda(n, ld) {
lambda = (n/T) * e^(-ld * 1/n);
return lambda;
}
A little explanation:
-ld * 1/n
is the relevancy measure core. Ifld
is low andn
is big, it comes close to zero (-0 side) and indicates thet this result is more relevant.n/T
is an accuracy ratio. Matched words vs. All words. Refines the previous relevance by taking the total user input into account.
The exponential function, for negatives powers, bound the result between 0 and 1.
And, finally, the question
What I want is not to refine the search algorithm, based on this response with an extra edit distance calculation, but to improve the relevancy sort of returned elements by asign a relevance value to each one. If any parameter other than n
and ld
is needed and easily computed, can be used. In my solution I added T
, the number of words the user provided.
ld
between 0 and 1? – Stallworth