I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest
) depending on different parameters. ParamGridBuilder()
allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame
is already defined:
rdc = RandomForestClassifier()
pipeline = Pipeline(stages=STAGES + [rdc])
paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])
.addGrid(rdc.minInfoGain, [0.01, 0.001])
.addGrid(rdc.numTrees, [5, 10, 20, 30])
.build()
evaluator = MulticlassClassificationEvaluator()
valid = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.50)
model = valid.fit(df)
result = model.bestModel.transform(df)
OK so now I'm able to retrieves simple information with a handmade function:
def evaluate(result):
predictionAndLabels = result.select("prediction", "label")
metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]
for m in metrics:
evaluator = MulticlassClassificationEvaluator(metricName=m)
print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))
Now I want several things:
- What are the parameters of the best model? This post partially answers the question: How to extract model hyper-parameters from spark.ml in PySpark?
- What are the parameters of all models?
- What are the results (aka recall, accuracy, etc...) of each model ? I only found
print(model.validationMetrics)
that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.
If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn
.
zip(model.validationMetrics, model.getEstimatorParamMaps())
does not work with models. When I print the model.params, it shows nothing (and my paramGrid works I am sure) – Downstairs