Spark - How to use the trained recommender model in production?

Asked 7/1, 2015 at 8:23 Answered 31/12, 2015 at 6:39

Solved apache-spark mahout recommendation-engine mahout-recommender

I am using Spark to build a recommendation system prototype. After going through some tutorials, I have been able to train a MatrixFactorizationModel from my data.

However, the model trained by Spark mllib is just a Serializable. How can I use this model to do recommendation for real users? I mean, how can I persist the model into some sort of database or update it if the user data has been incremented?

For example, the model trained by Mahout recommendation library can be stored into databases like Redis, then we can query for the recommended item list later. But how can we do similar stuff in Spark? Any suggestion?

Historiated answered 7/1, 2015 at 8:23 Comment(0)

First, the "model" you're referring to from Mahout is not a model, but a pre-computed list of recommendations. You could also do this with Spark, and compute in batch recommendations for users, and persist them anywhere you like. This has nothing to do with serializing a model. If you don't want to do real-time updates or scoring, you can stop there and just use Spark for batch just like you do Mahout.

But I agree that in a lot of cases you do want to ship the model somewhere else and serve it. As you can see, other models in Spark are Serializable, but not MatrixFactorizationModel. (Yes, even though it's marked as such, it won't serialize.) Likewise, there is a standard serialization for predictive models called PMML but it contains no vocabulary for a factored matrix model.

The reason is actually the same. Whereas many predictive models, like an SVM or logistic regression model, are just a small set of coefficients, a factored matrix model is huge, containing two matrices with potentially billions of elements. That is why I think PMML doesn't have any reasonable encoding for it.

Likewise, in Spark, that means the actual matrices are RDDs that can't be serialized directly. You can persist these RDDs to storage, re-read them elsewhere using Spark, and recreate a MatrixFactorizationModel by hand that way.

You can't serve or update the model using Spark though. For this you are really looking at writing some code to perform updates and calculate recommendations on the fly.

I don't mind suggesting here the Oryx project, since its point is to manage exactly this aspect, particularly for ALS recommendation. In fact, the Oryx 2 project is based on Spark and although in alpha, already contains the complete pipeline to serialize and serve the output of MatrixFactorizationModel. I don't know if it meets your needs, but may at least be an interesting reference point.

Telltale answered 7/1, 2015 at 9:46 Comment(1)

Thanks for your the excellent and detailed explanation! I will try Oryx:) – Historiated 7/1, 2015 at 16:54

Another method for creating recs with Spark is the search engine method. This is basically a cooccurrence recommender served by Solr or Elasticsearch. Comparing factorized to cooccurrence is beyond this question so I'll just describe the latter.

You feed interactions (user-id,item-id) into Mahout's spark-itemsimilarity. This produces a list of similar items for every item seen in the interaction data. It will come out by default as a csv and so can be stored anywhere. But it needs to be indexed by a search engine.

In any case when you want to fetch recs you use the user's history as the query, you get back an ordered list of items as recs.

One benefit of this method is that indicators can be calculated for as many user actions as you want. Any action the user takes that correlates to what you want to recommend can be used. For instance if you want to recommend a purchase but you record product-views as well. If you treated product-views the same as purchases you would likely get worse recs (I've tried it). However if you calculate an indicator for purchases and another (actually cross-cooccurrence) indicator for product-views they are equally predictive of purchases. This has the effect of increasing the data used for recs. The same type of thing can be done with user locations to blend in location information into purchase recs.

You can also bias your recs based on context. If you are in the "electronics" section of a catalog, you may want recs to be skewed towards electronics. Add electronics to the query against the item's "category" metadata field and give it a boost in the query and you have biased recs.

Since all of the biasing and mixing of indicators happens in the query it makes the recs engine easily tuned to multiple contexts while maintaining only one multi-field query made through a search engine. We get scalability from Solr or Elasticsearch.

One other benefit of either factorization or the search method is that entirely new users and new history can be used to create recs where the older Mahout recommenders could only recommend to users and interactions known when the job was run.

Descriptions here:

Fairing answered 7/1, 2015 at 18:31 Comment(0)

You should run model.predictAll() on a reduced RDD set of (user,product) pairs like in the Mahout Hadoop Job and store the results for online usage...

https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java

Discomfit answered 19/1, 2015 at 21:7 Comment(0)

You can use the function .save(sparkContext, outputFolder) to save the model to a folder of your choice. While giving the recommendations in realtime, you just have to use MatrixFactorizationModel.load(sparkContext, modelFolder) function to load it as a MatrixFactorizationModel object.

A question to @Sean Owen: Doesn't the MatrixFactorizationObject contain the Factorization matrices: user-feature and item-feature matrices instead of recommendations/predicted ratings.

Animal answered 31/12, 2015 at 6:39 Comment(0)

Recommended topics

Hot tags