You can use a method inspired by Bayesian probability. The gist of the approach is to have an initial belief about the true rating of an item, and use users' ratings to update your belief.
This approach requires two parameters:
- What do you think is the true "default" rating of an item, if you have no ratings at all for the item? Call this number
R
, the "initial belief".
- How much weight do you give to the initial belief, compared to the user ratings? Call this
W
, where the initial belief is "worth" W
user ratings of that value.
With the parameters R
and W
, computing the new rating is simple: assume you have W
ratings of value R
along with any user ratings, and compute the average. For example, if R = 2
and W = 3
, we compute the final score for various scenarios below:
- 100 (user) ratings of 4:
(3*2 + 100*4) / (3 + 100) = 3.94
- 3 ratings of 5 and 1 rating of 4:
(3*2 + 3*5 + 1*4) / (3 + 3 + 1) = 3.57
- 10 ratings of 4:
(3*2 + 10*4) / (3 + 10) = 3.54
- 1 rating of 5:
(3*2 + 1*5) / (3 + 1) = 2.75
- No user ratings:
(3*2 + 0) / (3 + 0) = 2
- 1 rating of 1:
(3*2 + 1*1) / (3 + 1) = 1.75
This computation takes into consideration the number of user ratings, and the values of those ratings. As a result, the final score roughly corresponds to how happy one can expect to be about a particular item, given the data.
Choosing R
When you choose R
, think about what value you would be comfortable assuming for an item with no ratings. Is the typical no-rating item actually 2.4 out of 5, if you were to instantly have everyone rate it? If so, R = 2.4
would be a reasonable choice.
You should not use the minimum value on the rating scale for this parameter, since an item rated extremely poorly by users should end up "worse" than a default item with no ratings.
If you want to pick R
using data rather than just intuition, you can use the following method:
- Consider all items with at least some threshold of user ratings (so you can be confident that the average user rating is reasonably accurate).
- For each item, assume its "true score" is the average user rating.
- Choose
R
to be the median of those scores.
If you want to be slightly more optimistic or pessimistic about a no-rating item, you can choose R
to be a different percentile of the scores, for instance the 60th percentile (optimistic) or 40th percentile (pessimistic).
Choosing W
The choice of W
should depend on how many ratings a typical item has, and how consistent ratings are. W
can be higher if items naturally obtain many ratings, and W
should be higher if you have less confidence in user ratings (e.g., if you have high spammer activity). Note that W
does not have to be an integer, and can be less than 1.
Choosing W
is a more subjective matter than choosing R
. However, here are some guidelines:
- If a typical item obtains
C
ratings, then W
should not exceed C
, or else the final score will be more dependent on R
than on the actual user ratings. Instead, W
should be close to a fraction of C
, perhaps between C/20
and C/5
(depending on how noisy or "spammy" ratings are).
- If historical ratings are usually consistent (for an individual item), then
W
should be relatively small. On the other hand, if ratings for an item vary wildly, then W
should be relatively large. You can think of this algorithm as "absorbing" W
ratings that are abnormally high or low, turning those ratings into more moderate ones.
- In the extreme, setting
W = 0
is equivalent to using only the average of user ratings. Setting W = infinity
is equivalent to proclaiming that every item has a true rating of R
, regardless of the user ratings. Clearly, neither of these extremes are appropriate.
- Setting
W
too large can have the effect of favoring an item with many moderately-high ratings over an item with slightly fewer exceptionally-high ratings.