"empty collection" error when trying to load a saved Spark model using pyspark
Asked Answered
C

1

6

I'm building a Random Forest model using Spark and I want to save it to use again later. I'm running this on pyspark (Spark 2.0.1) without HDFS, so the files are saved to the local file system.

I've tried to do it like so:

import pyspark.sql.types as T
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

data = [[0, 0, 0.],
        [0, 1, 1.],
        [1, 0, 1.],
        [1, 1, 0.]]

schema = T.StructType([
    T.StructField('a', T.IntegerType(), True),
    T.StructField('b', T.IntegerType(), True),
    T.StructField('label', T.DoubleType(), True)])

df = sqlContext.createDataFrame(data, schema)

assembler = VectorAssembler(inputCols=['a', 'b'], outputCol='features')
df = assembler.transform(df)

classifier = RandomForestClassifier(numTrees=10, maxDepth=15, labelCol='label', featuresCol='features')
model = classifier.fit(df)

model.write().overwrite().save('saved_model')

And then, to load the model:

from pyspark.ml.classification import RandomForestClassificationModel

loaded_model = RandomForestClassificationModel.load('saved_model')

But I get this error:

Py4JJavaError: An error occurred while calling o108.load.
: java.lang.UnsupportedOperationException: empty collection

I'm not sure to which collection it is referring to. Any ideas how to properly load (or save) the model?

Criss answered 26/1, 2017 at 19:15 Comment(2)
I've seen several questions that are similar to this one, but the error they're having is different from mine, e.g. #40327879Criss
Is this question still valid? I just checked on Spark 3.1.2 running without HDFS and the code you pasted worked out of the box. Maybe try upgrading Spark version, as it seems this issue has been addressed alreadyQuicksand
R
0

got a similar issue on a spark cluster with jupyter notebook installed on 4 differents docker containers. fixed the issue by using the same persistent folder which can be updated by all dockers containers and save the model on it. So suggestion is to ensure that you are using the same persistent folder and that spark and your python program can update it

Rehabilitation answered 20/1, 2020 at 11:39 Comment(1)
Could you please provide an example onn how to set a persistent folder?Dissertation

© 2022 - 2024 — McMap. All rights reserved.