Client-side predictive search relevance calculation with multiple parameters

I'm writting a predictive search that, for server performance requisites (all is cached), must be run on the client browser. The items are TV shows and movies and are matched by title, actors and directors names. Once the search has been executed, it returns a list of matched items with two values for each result:

The number of matching words (n): The user can type 4 words but only 2 of them matched an item. The more the better.
The Levenshtein Edit Distance addition (ld). The user can type 3 words but 2 of them have a typo or other small diference with the indexed ones. I use the edit distance to find nearest indexed word. The adition of all the Levenshtein Distances is returned as a proximity indicator. The less the better.

Requirements

Client-side. No Sphinx, Lucene or any other server-side solutions.
Speed over accuracy. The algorithm runs on each keystroke, and we don't want to bore the user. Keep the big O not so big.
Non-recursive. The calculation of each item relevance shouldn't be dependant to the other itmes calculation. I don't want to beat Google, only provide first the best result of a small set.
Bounded form 0 to 1, 0 to 100 or something like this. Is not a requisite, but be able to show a "relevance percent" is a plus.
Ideas over Implementations. I'm looking for an algorithm/formula better that for a specific implementation.

My aproach

Based on the exponential decay (like the radioactive half-life decomposition) I made up this formula.

Math style, thanks to wikipedia LaTeX support

Where:

T is the number of words the user provided.
n is the number of matching words.
ld is the Levenshtein distance addition for this matching words.

In pseudocode.

function lambda(n, ld) {
    lambda = (n/T) * e^(-ld * 1/n);
    return lambda;
}

A little explanation:

-ld * 1/n is the relevancy measure core. If ld is low and n is big, it comes close to zero (-0 side) and indicates thet this result is more relevant.
n/T is an accuracy ratio. Matched words vs. All words. Refines the previous relevance by taking the total user input into account.

The exponential function, for negatives powers, bound the result between 0 and 1.

And, finally, the question

What I want is not to refine the search algorithm, based on this response with an extra edit distance calculation, but to improve the relevancy sort of returned elements by asign a relevance value to each one. If any parameter other than n and ld is needed and easily computed, can be used. In my solution I added T, the number of words the user provided.

In general, I believe you must simplify your formula. In fact, most basic relevance formulas like tf-idf are quite simple and use just production or fraction of parameters with, possibly, "strengthening" or "weakening" functions. For example, tf-idf is just multiplication of term frequency and "weaked" by logarithm inverse document frequency. Firstly I'll make quick analysis of your formula and then make some proposals.

Analysis

Let's rewrite your formula:

enter image description here

First of all, note, that n/T is not normalized: there may be more results (n) then search terms (T). Consider such example: user enters query "John Malkovich", and there's film Being John Malkovich in your database. User entered 2 words, i.e. T = 2, but film has these terms in both - film title and the cast, so n = 2 * 2 = 4. Given that, final relevance will be grater then 1. Lack of normalization is not the problem by itself, but in practice it may lead to a number of errors in future.

Now let's look at the second part of your formula - 1 / e^(ld/n). Let's denote ld/n as x. In this case second part of your formula will look like this:

enter image description here

So for high values of x it will weaken final relevance a lot. Though I don't understand why it must be exponential, it still makes sense. But x is not independent variable, it is by itself function of 2 variables: x = x(ld, n). Moreover, ld is also a function: ld = ld(MAX_LD, T), so x depends on 3 different independent variables/parameters: x = x(MAX_LD, T, n). In this case it is very very hard to predict behavior of x (and so final relevance) for all possible cases.

Proposals

1. Simplify x(). If you want the second part of your formula to keep track only of Levenshtein distance, let it depend on only this distance, but not all 3 independent variables. For example, your formula may look like:

enter image description here

or even:

enter image description here

where distance is actual Levenshtein edit distance, and not the function of T and MAX_LD.

2. Use stemming. I know, I know, you said, you cannot use server-side programming. But are you sure it cannot be performed on the client side? Stemming is much easier then it seems to be. Most of stemming is just about truncating suffixes and endings like "-s", "-ing", "-ment", etc. It is not ideal, but I believe that it will give much better results, then Levenshtein distance. The only strong restriction here is: stemming must be used for both - indexing and searching.

For more precise stemming algo see PorterStemmer class from Lucene sources.

3. Use inverse record frequency. Recall example with query "John Malkovich". There may be plenty of films with term "John", but only several with "Malkovich". It is natural to assume, that the second term must have greater weight in search results, then the first one. tf-idf involves this fact in its idf (inverse document frequency) part. You can do the same thing by computing inverse record frequency as:

irf = number-of-all-found-records / number-of-records-with-current-term

and adding to your relevance formula third part:

enter image description here

And, of course, remember: no formula is good until it is tested on real data.

Analysis

Proposals

Recommended topics

Hot tags