SPARK, ML, Tuning, CrossValidator: access the metrics
Asked Answered
U

2

10

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline:

val cv = new CrossValidator()
        .setEstimator(pipeline)
        .setEstimatorParamMaps(paramGrid)
        .setEvaluator(new MulticlassClassificationEvaluator)
        .setNumFolds(10)

val cvModel = cv.fit(trainingSet)

The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes.

Is it possible to access the metrics calculated for best model?

Ideally, I would like to access the metrics of all models to see how changing the parameters is changing the quality of the classification. But for the moment, the best model is good enough.

FYI, I am using Spark 1.6.0

Underproof answered 8/1, 2016 at 13:59 Comment(0)
D
11

Here's how I do it:

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, stopWordsFilter, tf, idf, word2Vec, featureVectorAssembler, categoryIndexerModel, classifier, categoryReverseIndexer))

...

val paramGrid = new ParamGridBuilder()
  .addGrid(tf.numFeatures, Array(10, 100))
  .addGrid(idf.minDocFreq, Array(1, 10))
  .addGrid(word2Vec.vectorSize, Array(200, 300))
  .addGrid(classifier.maxDepth, Array(3, 5))
  .build()

paramGrid.size // 16 entries

...

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)

...

val bestModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel]

// Explain params for each stage
val bestHashingTFNumFeatures = bestModel.stages(2).asInstanceOf[HashingTF].explainParams
val bestIDFMinDocFrequency = bestModel.stages(3).asInstanceOf[IDFModel].explainParams
val bestWord2VecVectorSize = bestModel.stages(4).asInstanceOf[Word2VecModel].explainParams
val bestDecisionTreeDepth = bestModel.stages(7).asInstanceOf[DecisionTreeClassificationModel].explainParams
Daltondaltonism answered 8/1, 2016 at 21:48 Comment(1)
zip works but I really don't like it because it assumes internal knowledge about how the CrossValidator works. They could change how the metrics array gets built so its in a different order for the next version and you are hosed, but don't know your used because your code still works. I'd like to have the params for a model returned with its metric. I'd also like to see summary stats instead of just the mean. How useful is a mean without a standard deviation?Barbee
D
1
 cvModel.avgMetrics

works in pyspark 2.2.0

Danforth answered 9/11, 2017 at 21:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.