How to score all user-product combinations in Spark MatrixFactorizationModel?

Asked 12/10, 2014 at 15:21 Answered 13/12, 2017 at 0:17

Solved apache-spark apache-spark-mllib matrix-factorization

Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)?

Via the current API, once could pass a cartesian product of user-product to the predict function, but it seems to me that this will do a lot of extra processing.

Would accessing the private userFeatures, productFeatures be the correct approach, and if so, is there a good way to take advantage of other aspects of the framework to distribute this computation in an efficient way? Specifically, is there an easy way to do better than multiplying all pairs of userFeature, productFeature "by hand"?

Sharyl answered 12/10, 2014 at 15:21 Comment(0)

Spark 1.1 has a recommendProducts method that can be mapped to each user ID. This is better than nothing but not really optimized for recommending to all users.

I would double-check that you really mean to make recommendations for everyone; at scale, this is inherently a big slow operation. Consider predicting for users that have been recently active only.

Otherwise, yes your best bet is to create your own method. The cartesian join of the feature RDDs is probably too slow as it's shuffling so many copies of the feature vectors. Choose the larger of the user / product feature set, and map that. In each worker, hold the other product / user feature set in memory in each worker. If this isn't feasible you can make this more complex and map several times against subsets of the smaller RDD in memory.

Model answered 12/10, 2014 at 15:52 Comment(1)

Yes, it is a big operation which is why it seemed worth the effort to try to optimize further. Thanks for the suggestions! – Sharyl 15/10, 2014 at 11:57

As of Spark 2.2, recommendProductsForUsers(num) would be the method.

Recommends the top "num" number of products for all users. The number of recommendations returned per user may be less than "num".

https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html

Dandle answered 13/12, 2017 at 0:17 Comment(0)

Recommended topics

Hot tags