How to provide most relevant results with Multiple Factor Weighted Sorting
Asked Answered
S

3

36

I need to provide a weighted sort on 2+ factors, ordered by "relevancy". However, the factors aren't completely isolated, in that I want one or more of the factors to affect the "urgency" (weight) of the others.

Example: contributed content (articles) can be up-/down-voted, and thus have a rating; they have a post date, and they're also tagged with categories. Users write the articles and can vote, and may or may not have some kind of ranking themselves (expert, etc). Probably similar to StackOverflow, right?

I want to provide each user with a list of articles grouped by tag but sorted by "relevancy", where relevancy is calculated based on the rating and age of the article, and possibly affected by the ranking of the author. I.E. a highly ranked article that was written several years ago may not necessarily be as relevant as a medium ranked article written yesterday. And maybe if an article was written by an expert it would be treated as more relevant than one written by "Joe Schmoe".

Another good example would be assigning hotels a "meta score" comprised of price, rating, and attractions.

My question is, what is the best algorithm for multiple factor sorting? This may be a duplicate of that question, but I'm interested in a generic algorithm for any number of factors (a more reasonable expectation is 2 - 4 factors), preferably a "fully-automatic" function that I don't have to tweak or require user input, and I can't parse linear algebra and eigenvector wackiness.


Possibilities I've found so far:

Note: S is the "sorting score"

  1. "Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax). I think this is kinda how Lucene search works.
  2. "Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 + 10 * F3 ....
  3. Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
  4. Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).

Options #3 or #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but the problem is how do you do this for more than two factors?

I also came across the SQL implementation for a two-factor weighting algorithm, which is basically what I'll need to write eventually.

Stoker answered 6/1, 2012 at 15:57 Comment(6)
Just for clarity, which factor would you have change the weights of which other factors in your example? Is one of them much more important than the others, or do you just want to avoid manually establishing weights?Watanabe
@Watanabe I honestly don't remember (2+ years ago); I probably just wanted to avoid manually establishing weights, since any time we changed our mind regarding importance we'd have to deploy code, as well as picking the correct weights in the first place.Stoker
Sorry I realized it was a 2 year old post after the comment. I was going to suggest you use what's called a 'compromise solution' in optimization lingo. Basically, you choose the absolute ideal 'point' in your solution space (highest rank poster, newest date, etc.) and then the inverse of the euclidean distance from that point would be your score. i.e. S = 1/(sqrt((rank - rank_ideal)^2 + (age - age_ideal)^2 ... (xn - xn_ideal)^2); Anyway, hope you got it figured out.Watanabe
@Watanabe no worries; you should post that suggestion as an answer so it'll be found more easilyStoker
For the Linearly weighted algorithm, do the weights have to add up to 1? What happens if I have something like S = (f1 * .80) + (f2 * .80)?Noakes
@Noakes the internet explodes and you get something like SO's April Fools redesign (Comic Sans comments!!!)...but more likely it would just arbitrarily inflate your final value, which may not matter if it's just for sorting.Stoker
W
8

As mentioned in the comments, I would suggest what's called the 'compromise solution' to anyone with a similar problem who is more concerned with not having to set weights than with making one criterion more heavily weighted than the others.

Basically, you consider each of your criterion as a coordinate (after normalization, of course). Based on your judgement, you choose the absolute optimal point, e.g. in this case, the highest rank author, the newest article, etc. Once you choose the optimal solution, each other 'solution' is rated based on its distance from that optimal. A sample formula would be the inverse of the Euclidean distance for each article's score: S = 1/(sqrt((rank - rank_ideal)^2 + (age - age_ideal)^2 + ... + (xn - xn_ideal)^2)).

This treats all criteria as equal, so keep that in mind.

Watanabe answered 30/12, 2014 at 17:44 Comment(4)
wont this be a division by zero if it hits the exact same match?After
Yes, in the event you have a non-unique set, division by zero is possible. This is trivial to handle in code (calculate the divisor first, check for "smallness," error/throw out if necessary). That said, in this use case, non-uniqueness a) wasn't mentioned as a constraint and b) seems unlikely, given the type of dataset and the number of dimensions.Watanabe
Sorry for bothering you Sir, but I have another question! what if the values of each criteria has a very big difference like criteria #1 ranges from 1-30 and criteria #2 ranges on 1000+? The weights would be heavily pulled by the criteria #2 right? how can I normalize this?After
Divide each criteria/measurement by the maximum possible for that criteria. This will normalize each criteria to 1.Watanabe
D
1

The solution, pointed shortly by @gankoji is a simplification of the TOPSIS method.

In TOPSIS the compromise solution can be regarded as choosing the solution with the shortest Euclidean distance from the ideal solution and the farthest Euclidean distance from the negative ideal solution.

This class of problems falls under the term MCDM - Multiple Criteria Decision Making.

Python packages scikit-criteria and mcdm provide implementations of most popular methods. The package docs link to the respective algorithm papers.

Diamond answered 1/9, 2020 at 13:33 Comment(0)
D
0

Consider chaining of the weights. E.g. you have 3 factors: X, Y and Z. You can calculate ETVyz as W = (Z/Zmax * Y) + (1 - Z/Zmax) * Yavg for each record and then calculate ETVxw as S = (W/Wmax * X) + (1 - W/Wmax) * Xavg. You can chain more factors similary.

Darrondarrow answered 20/3, 2012 at 18:42 Comment(1)
but you can't normalize W (the W vs Wmax) in the function for ETVxw, because it's already the result of internally normalized factorsStoker

© 2022 - 2024 — McMap. All rights reserved.