I'm building a Random Forest model using Spark and I want to save it to use again later. I'm running this on pyspark (Spark 2.0.1) without HDFS, so the files are saved to the local file system.
I've tried to do it like so:
import pyspark.sql.types as T
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
data = [[0, 0, 0.],
[0, 1, 1.],
[1, 0, 1.],
[1, 1, 0.]]
schema = T.StructType([
T.StructField('a', T.IntegerType(), True),
T.StructField('b', T.IntegerType(), True),
T.StructField('label', T.DoubleType(), True)])
df = sqlContext.createDataFrame(data, schema)
assembler = VectorAssembler(inputCols=['a', 'b'], outputCol='features')
df = assembler.transform(df)
classifier = RandomForestClassifier(numTrees=10, maxDepth=15, labelCol='label', featuresCol='features')
model = classifier.fit(df)
model.write().overwrite().save('saved_model')
And then, to load the model:
from pyspark.ml.classification import RandomForestClassificationModel
loaded_model = RandomForestClassificationModel.load('saved_model')
But I get this error:
Py4JJavaError: An error occurred while calling o108.load.
: java.lang.UnsupportedOperationException: empty collection
I'm not sure to which collection it is referring to. Any ideas how to properly load (or save) the model?