Client-side predictive search relevance calculation with multiple parameters
Asked Answered
H

1

8

I'm writting a predictive search that, for server performance requisites (all is cached), must be run on the client browser. The items are TV shows and movies and are matched by title, actors and directors names. Once the search has been executed, it returns a list of matched items with two values for each result:

  1. The number of matching words (n): The user can type 4 words but only 2 of them matched an item. The more the better.

  2. The Levenshtein Edit Distance addition (ld). The user can type 3 words but 2 of them have a typo or other small diference with the indexed ones. I use the edit distance to find nearest indexed word. The adition of all the Levenshtein Distances is returned as a proximity indicator. The less the better.

Requirements

  1. Client-side. No Sphinx, Lucene or any other server-side solutions.

  2. Speed over accuracy. The algorithm runs on each keystroke, and we don't want to bore the user. Keep the big O not so big.

  3. Non-recursive. The calculation of each item relevance shouldn't be dependant to the other itmes calculation. I don't want to beat Google, only provide first the best result of a small set.

  4. Bounded form 0 to 1, 0 to 100 or something like this. Is not a requisite, but be able to show a "relevance percent" is a plus.

  5. Ideas over Implementations. I'm looking for an algorithm/formula better that for a specific implementation.

My aproach

Based on the exponential decay (like the radioactive half-life decomposition) I made up this formula.

Math style, thanks to wikipedia LaTeX support

Where:

  • T is the number of words the user provided.
  • n is the number of matching words.
  • ld is the Levenshtein distance addition for this matching words.

In pseudocode.

function lambda(n, ld) {
    lambda = (n/T) * e^(-ld * 1/n);
    return lambda;
}

A little explanation:

  • -ld * 1/n is the relevancy measure core. If ld is low and n is big, it comes close to zero (-0 side) and indicates thet this result is more relevant.

  • n/T is an accuracy ratio. Matched words vs. All words. Refines the previous relevance by taking the total user input into account.

The exponential function, for negatives powers, bound the result between 0 and 1.

And, finally, the question

What I want is not to refine the search algorithm, based on this response with an extra edit distance calculation, but to improve the relevancy sort of returned elements by asign a relevance value to each one. If any parameter other than n and ld is needed and easily computed, can be used. In my solution I added T, the number of words the user provided.

Hershberger answered 28/1, 2011 at 13:7 Comment(4)
Is ld between 0 and 1?Stallworth
@Stallworth no, it isn't. Is a integer value between 0 and T*MAX_LD where MAX_LD is a setted threshold. The LD bewtween 'color' and 'colour' is 1, between 'house' and 'noses' is 3.Hershberger
I understand, you do not use stemming, so 'house' and 'houses' are two distinct words, and Levenshtein distance is used to overcome this issue?Stallworth
Right, no stemming because of the client-side restrictionHershberger
S
8

In general, I believe you must simplify your formula. In fact, most basic relevance formulas like tf-idf are quite simple and use just production or fraction of parameters with, possibly, "strengthening" or "weakening" functions. For example, tf-idf is just multiplication of term frequency and "weaked" by logarithm inverse document frequency. Firstly I'll make quick analysis of your formula and then make some proposals.

Analysis

Let's rewrite your formula:

enter image description here

First of all, note, that n/T is not normalized: there may be more results (n) then search terms (T). Consider such example: user enters query "John Malkovich", and there's film Being John Malkovich in your database. User entered 2 words, i.e. T = 2, but film has these terms in both - film title and the cast, so n = 2 * 2 = 4. Given that, final relevance will be grater then 1. Lack of normalization is not the problem by itself, but in practice it may lead to a number of errors in future.

Now let's look at the second part of your formula - 1 / e^(ld/n). Let's denote ld/n as x. In this case second part of your formula will look like this:

enter image description here

So for high values of x it will weaken final relevance a lot. Though I don't understand why it must be exponential, it still makes sense. But x is not independent variable, it is by itself function of 2 variables: x = x(ld, n). Moreover, ld is also a function: ld = ld(MAX_LD, T), so x depends on 3 different independent variables/parameters: x = x(MAX_LD, T, n). In this case it is very very hard to predict behavior of x (and so final relevance) for all possible cases.

Proposals

1. Simplify x(). If you want the second part of your formula to keep track only of Levenshtein distance, let it depend on only this distance, but not all 3 independent variables. For example, your formula may look like:

enter image description here

or even:

enter image description here

where distance is actual Levenshtein edit distance, and not the function of T and MAX_LD.

2. Use stemming. I know, I know, you said, you cannot use server-side programming. But are you sure it cannot be performed on the client side? Stemming is much easier then it seems to be. Most of stemming is just about truncating suffixes and endings like "-s", "-ing", "-ment", etc. It is not ideal, but I believe that it will give much better results, then Levenshtein distance. The only strong restriction here is: stemming must be used for both - indexing and searching.

For more precise stemming algo see PorterStemmer class from Lucene sources.

3. Use inverse record frequency. Recall example with query "John Malkovich". There may be plenty of films with term "John", but only several with "Malkovich". It is natural to assume, that the second term must have greater weight in search results, then the first one. tf-idf involves this fact in its idf (inverse document frequency) part. You can do the same thing by computing inverse record frequency as:

irf = number-of-all-found-records / number-of-records-with-current-term

and adding to your relevance formula third part:

enter image description here

And, of course, remember: no formula is good until it is tested on real data.

Stallworth answered 30/1, 2011 at 18:27 Comment(3)
About the Analysis n/T is normalized because the matching index is unique, not taking account duplicated words in title (The Lord of the Rings: The Return of the King) nor in both, title and cast (Being John Malkovich). The exponential function was, at first, the easy way to bind the output between 0 and 1 (I forgot about the 1/(1+x) option). After that, I realized that the exponential decay make sense, as you said, with the concept of relevanceHershberger
About the proposals From my point of view, I prefer no stemming to bad stemming. I will try some x() simplifications and some of the tf-idf ideas with the real corpus but is really small and all words are infrequent. Maybe is useful to weak the usual suspects ('the', 'and', 'of'...). I’ll keep you updated.Hershberger
After some tests on real data, almost all modifications show similar results (small corpus size). In the end I decided to add tf-idf as a third part and replace de 1/e^x with 1/(1+x) to get a smoothest decay. By the other hand, I stay with ld/n because of the relevance added. Thank you very much.Hershberger

© 2022 - 2024 — McMap. All rights reserved.