Incremental training of ALS model
Asked Answered
S

2

18

I'm trying to find out if it is possible to have "incremental training" on data using MLlib in Apache Spark.

My platform is Prediction IO, and it's basically a wrapper for Spark (MLlib), HBase, ElasticSearch and some other Restful parts.

In my app data "events" are inserted in real-time, but to get updated prediction results I need to "pio train" and "pio deploy". This takes some time and the server goes offline during the redeploy.

I'm trying to figure out if I can do incremental training during the "predict" phase, but cannot find an answer.

Swear answered 1/1, 2015 at 20:21 Comment(4)
Does PIO support Spark Streaming and augmenting existing prediction results from the StreamRDD's?Nyhagen
I just checked, online/incremental training has been implemented for streamed linear regression and streamed clustering. Unfortunately no streamed collaborative filtering (ALS) nor other streamed classification/regression methods yet.Marroquin
Streaming k-meansFarmyard
See here for a possible solution: #41537970Unformed
J
4

I imagine you are using spark MLlib's ALS model which is performing matrix factorization. The result of the model are two matrices a user-features matrix and an item-features matrix.

Assuming we are going to receive a stream of data with ratings or transactions for the case of implicit, a real (100%) online update of this model will be to update both matrices for each new rating information coming by triggering a full retrain of the ALS model on the entire data again + the new rating. In this scenario one is limited by the fact that running the entire ALS model is computationally expensive and the incoming stream of data could be frequent, so it would trigger a full retrain too often.

So, knowing this we can look for alternatives, a single rating should not change the matrices much plus we have optimization approaches which are incremental, for example SGD. There is an interesting (still experimental) library written for the case of Explicit Ratings which does incremental updates for each batch of a DStream:

https://github.com/brkyvz/streaming-matrix-factorization

The idea of using an incremental approach such as SGD follows the idea of as far as one moves towards the gradient (minimization problem) one guarantees that is moving towards a minimum of the error function. So even if we do an update to the single new rating, only to the user feature matrix for this specific user, and only the item-feature matrix for this specific item rated, and the update is towards the gradient, we guarantee that we move towards the minimum, of course as an approximation, but still towards the minimum.

The other problem comes from spark itself, and the distributed system, ideally the updates should be done sequentially, for each new incoming rating, but spark treats the incoming stream as a batch, which is distributed as an RDD, so the operations done for updating would be done for the entire batch with no guarantee of sequentiality.

In more details if you are using Prediction.IO for example, you could do an off line training which uses the regular train and deploy functions built in, but if you want to have the online updates you will have to access both matrices for each batch of the stream, and run updates using SGD, then ask for the new model to be deployed, this functionality of course is not in Prediction.IO you would have to build it on your own.

Interesting notes for SGD updates:

http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf

Jeannettejeannie answered 28/4, 2016 at 14:36 Comment(1)
This is a very plausible answer ! +1Hooker
C
0

For updating Your model near-online (I write near, because face it, the true online update is impossible) by using fold-in technique, e.g.: Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender Systems.

Ou You can look at code of:

  • MyMediaLite
  • Oryx - framework build with Lambda Architecture paradigm. And it should have updates with fold-in of new users/items.

It's the part of my answer for similar question where both problems: near-online training and handling new users/items were mixed.

Culley answered 21/4, 2016 at 11:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.