Simple Popularity Algorithm
Asked Answered
C

5

22

Summary

As Ted Jaspers wisely pointed out, the methodology I described in the original proposal back in 2012 is actually a special case of an exponential moving average. The beauty of this approach is that it can be calculated recursively, meaning you only need to store a single popularity value with each object and then you can recursively adjust this value when an event occurs. There's no need to record every event.

This single popularity value represents all past events (within the limits of the data type being used), but older events begin to matter exponentially less as new events are factored in. This algorithm will adapt to different time scales and will respond to varying traffic volumes. Each time an event occurs, the new popularity value can be calculated using the following formula:

(a * t) + ((1 - a) * p)

  • a — coefficient between 0 and 1 (higher values discount older events faster)
  • t — current timestamp
  • p — current popularity value (e.g. stored in a database)

Reasonable values for a will depend on your application. A good starting place is a=2/(N+1), where N is the number of events that should significantly affect the outcome. For example, on a low-traffic website where the event is a page view, you might expect hundreds of page views over a period of a few days. Choosing N=100 (a≈0.02) would be a reasonable choice. For a high-traffic website, you might expect millions of page views over a period of a few days, in which case N=1000000 (a≈0.000002) would be more reasonable. The value for a will likely need to be gradually adjusted over time.

To illustrate how simple this popularity algorithm is, here's an example of how it can be implemented in Craft CMS in 2 lines of Twig markup:

{% set popularity = (0.02 * date().timestamp) + (0.98 * entry.popularity) %}
{% do entry.setFieldValue("popularity", popularity) %}

Notice that there's no need to create new database tables or store endless event records in order to calculate popularity.

One caveat to keep in mind is that exponential moving averages have a spin-up interval, so it takes a few recursions before the value can be considered accurate. This means the initial condition is important. For example, if the popularity of a new item is initialized using the current timestamp, the item immediately becomes the most popular item in the entire set before eventually settling down into a more accurate position. This might be desirable if you want to promote new content. Alternatively, you may want content to work its way up from the bottom, in which case you could initialize it with the timestamp of when the application was first launched. You could also find a happy medium by initializing the value with an average of all popularity values in the database, so it starts out right in the middle.


Original Proposal

There are plenty of suggested algorithms for calculating popularity based on an item's age and the number of votes, clicks, or purchases an item receives. However, the more robust methods I've seen often require overly complex calculations and multiple stored values which clutter the database. I've been contemplating an extremely simple algorithm that doesn't require storing any variables (other than the popularity value itself) and requires only one simple calculation. It's ridiculously simple:

p = (p + t) / 2

Here, p is the popularity value stored in the database and t is the current timestamp. When an item is first created, p must be initialized. There are two possible initialization methods:

  1. Initialize p with the current timestamp t
  2. Initialize p with the average of all p values in the database

Note that initialization method (1) gives recently added items a clear advantage over historical items, thus adding an element of relevance. On the other hand, initialization method (2) treats new items as equals when compared to historical items.

Let's say you use initialization method (1) and initialize p with the current timestamp. When the item receives its first vote, p becomes the average of the creation time and the vote time. Thus, the popularity value p still represents a valid timestamp (assuming you round to the nearest integer), but the actual time it represents is abstracted.

With this method, only one simple calculation is required and only one value needs to be stored in the database (p). This method also prevents runaway values, since a given item's popularity can never exceed the current time.

An example of the algorithm at work over a period of 1 day: http://jsfiddle.net/q2UCn/
An example of the algorithm at work over a period of 1 year: http://jsfiddle.net/tWU9y/

If you expect votes to steadily stream in at sub-second intervals, then you will need to use a microsecond timestamp, such as the PHP microtime() function. Otherwise, a standard UNIX timestamp will work, such as the PHP time() function.

Now for my question: do you see any major flaws with this approach?

Climb answered 20/6, 2012 at 20:57 Comment(2)
If you allow people to "unlike" items, this doesn't require only storing p in the database. You also have to store a record of every Like ever made. Otherwise, a user can Like, then Unlike, then Like, then Unlike, over and over again to inflate their vote. As you said, you only want to change the item's p when it receives it's first vote. Meaning you need to keep track of all votes.Kitchenmaid
@AlSweigart good point. This algorithm is probably only appropriate for unidirectional voting systems (e.g. a page view is one "vote" in the positive direction). It's probably less compatible with bi-directional voting systems.Climb
B
4

The proposed algorithm is a good approach, and is a special case of an Exponential Moving Average where alpha=0.5:

p = alpha*p + (1-alpha)*t = 0.5*p + 0.5*t = (p+t)/2    //(for alpha = 0.5)

A way to tweak the fact that the proposed solution for alpha=0.5 tends to favor recent votes (as noted by daniloquio) is to choose higher values for alpha (e.g. 0.9 or 0.99). Note that applying this to the testcase proposed by daniloquio is not working however, because when alpha increases the algorithm needs more 'time' to settle (so the arrays should be longer, which is often true in real applications).

Thus:

  • for alpha=0.9 the algorithm averages approximately the last 10 values
  • for alpha=0.99 the algorithm averages approximately the last 100 values
  • for alpha=0.999 the algorithm averages approximately the last 1000 values
  • etc.
Baluchistan answered 3/8, 2018 at 10:47 Comment(1)
Awesome answer! One thing though—the article says "a higher α discounts older observations faster", so shouldn't the alpha values work the opposite way you've described?Climb
G
10

I think this is a very good approach, given its simplicity. A very interesting result.

I made a quick set of calculations and found that this algorithm does seem to understand what "popularity" means. Its problem is that it has a clear tendency to favor recent votes like this:

Imagine we take the time and break it into discrete timestamp values ranging from 100 to 1000. Assume that at t=100 both A and B (two items) have the same P = 100.

    A gets voted 7 times on 200, 300, 400, 500, 600, 700 and 800
resulting on a final Pa(800) = 700 (aprox).

    B gets voted 4 times on 300, 500, 700 and 900 
resulting on a final Pb(900) = 712 (aprox).

When t=1000 comes, both A and B receive votes, so:

Pa(1000) = 850 with 8 votes
Pb(1000) = 856 with 5 votes

Why? because the algorithm allows an item to quickly beat historical leaders if it receives more recent votes (even if the item has fewer votes in total).

EDIT INCLUDING SIMULATION

The OP created a nice fiddle that I changed to get the following results:

http://jsfiddle.net/wBV2c/6/

Item A receives one vote each day from 1970 till 2012 (15339 votes)
Item B receives one vote each month from Jan to Jul 2012 (7 votes)

The result: B is more popular than A.

Gravure answered 20/6, 2012 at 21:28 Comment(9)
Great analysis! You're right that the algorithm does favor more recent activity, which may or may not be desirable depending on the application. In my opinion, this behavior would be appropriate for most applications. Even so, its a small price to pay for the ease of implementation.Climb
@danielfaraday: Note that if you're using 32bit types, you might overflow, causing updates to drastically lower the rating temporarily.Bascule
@MooingDuck: I don't follow. It's assumed that P will be rounded, so the size will always be equal to the size of the timestamp (regardless of whether the granularity of the timestamp is in seconds, milliseconds, or microseconds).Climb
daniloquio: after further thought, I'm not sure how useful this example is. The difference in total votes is far too small to draw a valid conclusion. I'll repeat the comment I made to btilly: Let's say item A is created in the year 1912 and receives one vote each year until the year 2012 (100 votes total). In order for item B to beat item A at the last second with only 1 vote, it would need to have been created as recently as 2009! That's 97 years later! Here's proof: jsfiddle.net/wBV2c.Climb
@danielfaraday first of all, you better don't use your algorithm for votes prior 1970 or the negative timestamps will give you real pains. You are wrong here because Its not a matter of 1 vote beating all: it is all about a new hot item quickly beating historic leaders. Take a look at my edit in the answer.Gravure
@daniloquio: yes, you're right about starting prior to 1970 - not a smart idea. However, you had some mistakes in your code. I fixed the mistakes and the outcome is as you would expect - item A is more popular then item B: jsfiddle.net/wBV2c/8.Climb
In fact, even if item B receives a vote every day from January 2012 to July 2012, it still doesn't beat item A: jsfiddle.net/wBV2c/9.Climb
@danielfaraday in your code you calculate P just by adding t instead of doing (p + t)/2. I modified your code to calculate P according to your algorithm and adding a vote to B from January to December; guess what the result is. jsfiddle.net/wBV2c/10 If my answer wasn't even useful to you (my answer is the only one without +1) then I guess we are done with this discussion. Good day.Gravure
Oops! You're right. Here's the corrected code for the month example: jsfiddle.net/wBV2c/11. Item A wins in this case. Here's the corrected code for the day example: jsfiddle.net/wBV2c/12. Item B wins in this case. This is the behavior I would expect from a popularity algorithm. If one item suddenly stops receiving votes (item A in this example) and another item begins receiving votes at the same rate (item B in this example), I would expect the second (item B) to be the most popular. Your answer is definitely useful! +1Climb
B
4

The proposed algorithm is a good approach, and is a special case of an Exponential Moving Average where alpha=0.5:

p = alpha*p + (1-alpha)*t = 0.5*p + 0.5*t = (p+t)/2    //(for alpha = 0.5)

A way to tweak the fact that the proposed solution for alpha=0.5 tends to favor recent votes (as noted by daniloquio) is to choose higher values for alpha (e.g. 0.9 or 0.99). Note that applying this to the testcase proposed by daniloquio is not working however, because when alpha increases the algorithm needs more 'time' to settle (so the arrays should be longer, which is often true in real applications).

Thus:

  • for alpha=0.9 the algorithm averages approximately the last 10 values
  • for alpha=0.99 the algorithm averages approximately the last 100 values
  • for alpha=0.999 the algorithm averages approximately the last 1000 values
  • etc.
Baluchistan answered 3/8, 2018 at 10:47 Comment(1)
Awesome answer! One thing though—the article says "a higher α discounts older observations faster", so shouldn't the alpha values work the opposite way you've described?Climb
K
3

I see one problem, only the last ~24 votes count.

p_i+1 = (p + t) / 2

For two votes we have

p2 = (p1 + t2) / 2 = ((p0 + t1) /2 + t2 ) / 2 = p0/4 + t1/4 + t2/2

Expanding that for 32 votes gives:

p32 = t*2^-32 + t0*2^-32 + t1*2^-31 + t2*2^-30 + ... + t31*2^-1

So for signed 32 bit values, t0 has no effect on the result. Because t0 gets divided by 2^32, it will contribute nothing to p32.

If we have two items A and B (no matter how big the differences are) if they both get the same 32 votes, they will have the same popularity. So you're history goes back for only 32 votes. There is no difference in 2032 and 32 votes, if the last 32 votes are the same.

If the difference is less than a day, they will be equal after 17 votes.

Karlykarlyn answered 21/6, 2012 at 9:8 Comment(5)
This is incorrect. I prove you wrong here: jsfiddle.net/q2UCn. This is an actual calculation of the exact analysis you outline above (item A receiving 217 votes on the same day that item B receives 17 votes). I also ran this analysis on 25 votes over 1 year (jsfiddle.net/tWU9y), which yields a similar result.Climb
Oops, you're right. I didn't correctly interpreted the results. Fixed it.Karlykarlyn
Ishtar: that's correct. If two items both receive 32 votes at exactly the same times, then rounding will cause their popularity value to be the same. Here's proof: jsfiddle.net/c4RVr. However, the probability of this happening is extremely small unless votes are steadily streaming in at sub-second intervals. In this case, you can simply use a microsecond timestamp (such as the PHP microtime() function). This solves the problem. Here's proof: jsfiddle.net/k8HXu. This just depends on how much traffic you're expecting.Climb
@danielfaraday - Even for one item, after 32 votes all history is lost. You only keep track of the last 32 votes. Whether that's a problem is up to you. If you use microtime (or say 64 bit), you need more votes to clear the history, but still, voting a few second later has more effect then all the votes before the 32nd last vote.Karlykarlyn
@danielfaraday - You're welcome :). Here is a test jsfiddle.net/cyC7R with a not popular item and extremely popular item, both getting 32 random votes during a year. The difference fade out, you can try giving A 'toc' or B 0, it doesn't matter. Only the last few count!Karlykarlyn
L
2

The flaw is that something with 100 votes is usually more meaningful than something with only one recent vote. However it isn't hard to come up with variants of your scheme that work reasonably well.

Lorenelorens answered 20/6, 2012 at 21:7 Comment(4)
But the timestamp of the 1 recent vote would not be taken at face value. Instead, it would be averaged with a much older vote. This would likely cause the item to be ranked lower than the item with 100 votes (unless all 100 votes occurred a very long time ago).Climb
Also, if the 100 votes did in fact occur a very long time ago, a time-dependent definition of popularity would require that the item with 100 votes should be ranked lower than the 1 recent vote.Climb
+1 for that last sentence, I agree with you assuming some weird but infrequent results are acceptable.Gravure
Let's say item A is created in the year 1912 and receives one vote each year until the year 2012 (100 votes total). In order for item B to beat item A at the last second with only 1 vote, it would need to have been created as recently as 2009! That's 97 years later! Here's proof: jsfiddle.net/wBV2c. Thus, your comment is invalid. An item with 100 votes will have a higher popularity value, even if an item with 1 vote received that 1 vote very recently. This actually proves how robust this simple little algorithm really is.Climb
K
0

I don't think that the above-discussed logic is going to work. p_i+1= (p_i + t) /2

Article A gets viewed on timestamps: 70, 80, 90 popularity(Article A): 82.5

Article B gets viewed on timestamps: 50, 60, 70, 80, 90 popularity(Article B): 80.625

In this case, the popularity of Article B should have been more. Firstly Article B was viewed as recently as Article A and secondly, it was also viewed more times than Article A.

Kamin answered 27/4, 2018 at 4:13 Comment(1)
In your example, you would expect the popularity of A and B to be equal as t → ∞, which is exactly what happens. However, given the initial condition that Article A is newer than Article B, Article A will always have a slight edge over Article B. As discussed, this algorithm favors new content over old content. I think this actually represents real-life situations quite well—newer articles tend to be more relevant compared to older articles, so all other things being equal, the newer article should win. Remember, the goal of this algorithm is not to be the best, but rather the simplest.Climb

© 2022 - 2024 — McMap. All rights reserved.