What is the right way to save\load models in Spark\PySpark
Asked Answered
F

4

16

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation )

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
predictions.collect() # shows me some predictions
model.save(sc, "model0")

# Trying to load saved model and work with it
model0 = MatrixFactorizationModel.load(sc, "model0")
predictions0 = model0.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

After I try to use model0 I get a long traceback, which ends with this:

Py4JError: An error occurred while calling o70.predict. Trace:
py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

So my question is - am I doing something wrong? As far as I debugged my models are stored (locally and on HDFS) and they contain many files with some data. I have a feeling that models are saved correctly but probably they aren't loaded correctly. I also googled around but found nothing related.

Looks like this save\load feature has been added recently in Spark 1.3.0 and because of this I have another question - what was the recommended way to save\load models before the release 1.3.0? I haven't found any nice ways to do this, at least for Python. I also tried Pickle, but faced with the same issues as described here Save Apache Spark mllib model in python

Finegan answered 25/3, 2015 at 12:3 Comment(0)
M
7

One way to save a model (in Scala; but probably is similar in Python):

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("linReg.model")

Saved model can then be loaded as:

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

See also related question

For more details see (ref)

Medallion answered 19/9, 2015 at 4:17 Comment(0)
C
5

As of this pull request merged on Mar 28, 2015 (a day after your question was last edited) this issue has been resolved.

You just need to clone/fetch the latest version from GitHub (git clone git://github.com/apache/spark.git -b branch-1.3) then build it (following the instructions in spark/README.md) with $ mvn -DskipTests clean package.

Note: I ran into trouble building Spark because Maven was being wonky. I resolved that issue by using $ update-alternatives --config mvn and selecting the 'path' that had Priority: 150, whatever that means. Explanation here.

Canberra answered 31/3, 2015 at 13:26 Comment(1)
Yes, I've seen this PR, thanks! But I haven't tried to build Spark myself yet. Also thanks for a tip for Maven :)Finegan
S
2

I run into this also -- it looks like a bug. I have reported to spark jira.

Spermogonium answered 27/3, 2015 at 13:52 Comment(0)
R
2

Use pipeline in ML to train the model, and then use MLWriter and MLReader to save models and read them back.

from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

pipeTrain.write().overwrite().save(outpath)
model_in = PipelineModel.load(outpath)
Rochellerochemont answered 13/10, 2017 at 17:1 Comment(1)
thanks, but this question is very old :) many things have changed since the time it was asked.Finegan

© 2022 - 2024 — McMap. All rights reserved.