Deciding and implementing a trending algorithm in Django
Asked Answered
G

4

17

I have a Django application in which I need to implement a simple trending/ranking algorithm. I'm very lost as a :

I have two models, Book and Reader. Every night, new books are added to my database. The number of readers for each book are updated too every night i.e. One book will have multiple reader statistic records (one record for each day).

Over a given period (past week, past month or past year), I would like to list the most popular books, what algorithm should I use for this?

The popularity doesn't need to be realtime in any way because the reader count for each book is only updated daily.

I found one article which was referenced in another SO post that showed how they calculated trending Wikipedia articles but the post only showed how the current trend was calculated.

As someone pointed out on SO, it is a very simple baseline trend algorithm and only calculates the slope between two data points so I guess it shows the trend between yesterday and today.

I'm not looking for a uber complex trending algorithm like those used on Hacker News, Reddit, etc.

I have only two data axes, reader count and date.

Any ideas on what and how I should implement. For someone who's never worked with anything statistics/algorithm related, this seems to be a very daunting undertaking.

Thanks in advance everyone.

Glossolalia answered 14/2, 2012 at 20:39 Comment(0)
D
9

Probably the simplest possible trending "algorithm" I can think of is the n-day moving average. I'm not sure how your data is structured, but say you have something like this:

books = {'Twilight': [500, 555, 580, 577, 523, 533, 556, 593],
         'Harry Potter': [650, 647, 653, 642, 633, 621, 625, 613],
         'Structure and Interpretation of Computer Programs': [1, 4, 15, 12, 7, 3, 8, 19]
        }

A simple moving average just takes the last n values and averages them:

def moving_av(l, n):
    """Take a list, l, and return the average of its last n elements.
    """
    observations = len(l[-n:])
    return sum(l[-n:]) / float(observations)

The slice notation simply grabs the tail end of the list, starting from the nth to last variable. A moving average is a fairly standard way to smooth out any noise that a single spike or dip could introduce. The function could be used like so:

book_scores = {}
for book, reader_list in books.iteritems():
    book_scores[book] = moving_av(reader_list, 5)

You'll want to play around with the number of days you average over. And if you want to emphasize recent trends you can also look at using something like a weighted moving average.

If you wanted to focus on something that looks less at absolute readership and focuses instead on increases in readership, simply find the percent change in the 30-day moving average and 5-day moving average:

d5_moving_av = moving_av(reader_list, 5)
d30_moving_av = moving_av(reader_list, 30)
book_score = (d5_moving_av - d30_moving_av) / d30_moving_av

With these simple tools you have a fair amount of flexibility in how much you emphasize past trends and how much you want to smooth out (or not smooth out) spikes.

Deutzia answered 14/2, 2012 at 21:1 Comment(5)
HI Wilduck, I've been looking in to the EWMA calculation that you prescribed. That seems like a good fit for for my issue. I'm confused as to how to calculate the value of alpha α. Do you have any ideas how I can calculate this?Glossolalia
@MridangAgarwalla Good news! You don't have to calculate it! You can pick any number between zero and one, where a number closer to one discounts older observations faster. Your choice will depend on how much you want to discount older values, so you can play around with it until you find something you like.Deutzia
That being said, I think a simple moving average (one that isn't exponentially weighted) might work just as well for your purposes. I would suggest implementing the simpler version first, and then swapping in the exponentially weighted version if you find it isn't satisfactory.Deutzia
Hi Wildduck, I've been working on this. Is there a tool I could use into which I could load my delimited data and play around with different functions. Something on the lines of a statistical scratchpad. Thanks.Glossolalia
Well, the simplest answer would be to just use the csv module. If you're looking for something a little more fully featured, there's the pandas library (pandas.pydata.org). Using pandas and IPython (ipython.org) will give you a fully featured scientific computing environment.Deutzia
A
0

Popularity is easy; you just run a count on the readers and order by that:

Book.objects.annotate(reader_count=Count('readers')).order_by('-reader_count')

Trending is more difficult as this is more a popularity delta, i.e. which books have gains the most readers recently. If you want something like this, you'll need something running behind the scenes to keep a record of reader counts by date.

Adventurer answered 14/2, 2012 at 20:59 Comment(0)
S
0

You can take stackoverflow reputation ranking as example.

User can change view: by month, by year, ....

In your case: The most read book by month, by year.

To achieve this you should save day by day the number of readers for each book.

reader( date, book, total )

Then it is as simple as:

   Book.objects.filter(  
                   boor__reader__date__gte = some_date
                      ).annotate(
                            num_readers=Sum('book__reader__total')
                                ).order_by('-num_readers')
Siana answered 14/2, 2012 at 21:0 Comment(4)
Never do this.It is the easiest way to kill sql server.Shadchan
@iddqd, You are a bit apocalyptic. Please, link some resource that explain your sentence.Siana
Aggregate functions are very slow, full scans are very slow. Aggregate functions plus full scans are very very slow. To produce all time ranking you need to read all data.Shadchan
OP can use this query to populate Book fields and then work with this fields. You agree?Siana
H
0

I would do it systemically like this:

  1. Make a list of the most common questions or data points a user will be interested in, for example: 1.1 Top 100 Most popular books this week 1.2 Top 100 Most popular books this month

  2. After your daily reader/book info. is updated, I would run a job (probably nightly) to update a table of this info. Table will probably have Book and ReaderDelta fields where ReaderDelta is the change in readerCount over a week, month or year.

  3. You could also simply store the daily ReaderDelta and when looking for a month's worth of data, simply aggregate the past 30 days by date dynamically.

Haworth answered 14/2, 2012 at 21:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.