What's difference between item-based and content-based collaborative filtering?

Asked 4/5, 2013 at 8:22 Answered 8/5, 2013 at 18:35

Solved mahout recommendation-engine mahout-recommender

I am puzzled about what the item-based recommendation is, as described in the book "Mahout in Action". There is the algorithm in the book:

for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average

How can I calculate the similarity between items? If using the content, isn't it a content-based recommendation?

Sivie answered 4/5, 2013 at 8:22 Comment(0)

117

Item-Based Collaborative Filtering

The original Item-based recommendation is totally based on user-item ranking (e.g., a user rated a movie with 3 stars, or a user "likes" a video). When you compute the similarity between items, you are not supposed to know anything other than all users' history of ratings. So the similarity between items is computed based on the ratings instead of the meta data of item content.

Let me give you an example. Suppose you have only access to some rating data like below:

user 1 likes: movie, cooking
user 2 likes: movie, biking, hiking
user 3 likes: biking, cooking
user 4 likes: hiking

Suppose now you want to make recommendations for user 4.

First you create an inverted index for items, you will get:

movie:     user 1, user 2
cooking:   user 1, user 3
biking:    user 2, user 3
hiking:    user 2, user 4

Since this is a binary rating (like or not), we can use a similarity measure like Jaccard Similarity to compute item similarity.

                                 |user1|
similarity(movie, cooking) = --------------- = 1/3
                               |user1,2,3|

In the numerator, user1 is the only element that movie and cooking both has. In the denominator the union of movie and cooking has 3 distinct users (user1,2,3). |.| here denote the size of the set. So we know the similarity between movie and cooking is 1/3 in our case. You just do the same thing for all possible item pairs (i,j).

After you are done with the similarity computation for all pairs, say, you need to make a recommendation for user 4.

Look at the similarity score of similarity(hiking, x) where x is any other tags you might have.

If you need to make a recommendation for user 3, you can aggregate the similarity score from each items in its list. For example,

score(movie)  = Similarity(biking, movie) + Similarity(cooking, movie)
score(hiking) = Similarity(biking, hiking) + Similarity(cooking, hiking)

Content-Based Recommendation

The point of content-based is that we have to know the content of both user and item. Usually you construct user-profile and item-profile using the content of shared attribute space. For example, for a movie, you represent it with the movie stars in it and the genres (using a binary coding for example). For user profile, you can do the same thing based on the users likes some movie stars/genres etc. Then the similarity of user and item can be computed using e.g., cosine similarity.

Here is a concrete example:

Suppose this is our user-profile (using binary encoding, 0 means not-like, 1 means like), which contains user's preference over 5 movie stars and 5 movie genres:

         Movie stars 0 - 4    Movie Genres
user 1:    0 0 0 1 1          1 1 1 0 0
user 2:    1 1 0 0 0          0 0 0 1 1
user 3:    0 0 0 1 1          1 1 1 1 0

Suppose this is our movie-profile:

         Movie stars 0 - 4    Movie Genres
movie1:    0 0 0 0 1          1 1 0 0 0
movie2:    1 1 1 0 0          0 0 1 0 1
movie3:    0 0 1 0 1          1 0 1 0 1

To calculate how good a movie is to a user, we use cosine similarity:

                                 dot-product(user1, movie1)
similarity(user 1, movie1) = --------------------------------- 
                                   ||user1|| x ||movie1||

                              0x0+0x0+0x0+1x0+1x1+1x1+1x1+1x0+0x0+0x0
                           = -----------------------------------------
                                         sqrt(5) x sqrt(3)

                           = 3 / (sqrt(5) x sqrt(3)) = 0.77460

Similarly:

similarity(user 2, movie2) = 3 / (sqrt(4) x sqrt(5)) = 0.67082 
similarity(user 3, movie3) = 3 / (sqrt(6) x sqrt(5)) = 0.54772

If you want to give one recommendation for user i, just pick movie j that has the highest similarity(i, j).

Grantee answered 8/5, 2013 at 18:35 Comment(2)

Cooking has 2 distinct users? – Trautman 6/5, 2014 at 4:38

Great answer! My undestanding: I think we can say that the historical item-item approach uses the exact same data as user-user: rankings. User-user finds "users that rated the same products as you" (and recommend products loved by those similar-users) Item-item computes "products that have the same ratings as those you already bought" (and recommend those products directly). Same data, different approach. Content-based approach involves item intrinsic attributes as well. And hybrid is just mixing all that. – Profluent 19/10, 2022 at 12:23

"Item-based" really means "item-similarity-based". You can put whatever similarity metric you like in here. Yes, if it's based on content, like a cosine similarity over term vectors, you could also call this "content-based".

Sanguinary answered 4/5, 2013 at 8:56 Comment(7)

Great honor to get your answer.And In order to compare the effect of two recommendation methods, I use the RMSRecommenderEvaluator to evaluate.Although with the same parameter, but it can't guarantee the same training data and evaluate data.What can I do to compare them with the same data？ – Sivie 4/5, 2013 at 9:25

You mean because the random training set is different? Try calling RandomUtils.useTestSeed() before anything else executes. – Sanguinary 4/5, 2013 at 10:36

But I want to run several test case, and I want the result different. – Sivie 4/5, 2013 at 12:38

I think you will have to hack the code a bit to save and then reuse the same training set. But its probably as good to run the random tests many times and compare means. – Sanguinary 4/5, 2013 at 13:15

Yes, I run RecommenderEvaluator several times, and sort the result.It's what I expect to get.But Why don't design a API to change the STANDARD_SEED in RandomWrapper, thus to change the random utility? – Sivie 4/5, 2013 at 14:11

I don't understand the question. You should write to the mailing list and those who are still working on this old code – Sanguinary 4/5, 2013 at 23:11

I mean just change the seed RandomUtils.useTestSeed() used by calling a API – Sivie 5/5, 2013 at 1:37

Item-Based Collaborative Filtering

Content-Based Recommendation

Recommended topics

Hot tags