When we do a k-fold Cross Validation we are testing how well a model behaves when it comes to predict data it has never seen.
If split my dataset in 90% training and 10% test and analyse the model performance, there is no guarantee that my test set doesn't contain only the 10% "easiest" or "hardest" points to predict.
By doing a 10-fold cross validation I can be assured that every point will at least be used once for training. As (in this case) the model will be tested 10 times we can do an analysis of those tests metrics which will provide us with a better understanding of how the model is performing on classifying new data.
Spark Documentation refers to Cross Validation as a way to optimize algorithm's hyperparameters when the purpose should be model checking.
By doing this:
lr = LogisticRegression(maxIter=10, tol=1E-4)
ovr = OneVsRest(classifier=lr)
pipeline = Pipeline(stages=[... , ovr])
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=10)
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(df)
I'm able to obtain (in my understanding) a model with the best set of parameters defined in paramGrid. I understand the value of this hyperparameter tuning but what I want is to analyze a model performance, not just getting the best model.
The question is (for a 10-fold Cross Validation in this case):
Is it possible to use CrossValidator to extract metrics (f1, precision, recall, etc) for each one of the 10 tests (or an average of those 10 tests for each metric)?, i.e. Is it possible to use CrossValidator for model checking instead of model selection?
Thanks!
Update
As user10465355 stated in the comments a similar question can can be found here. The first suggestion is to set collectSubModels to true before fitting and that threw an error saying that the keyword didn't existed (honestly I didn't spent a lot of time trying to figure out why).
The user Mack provides in his answer a workaround to print out the intermediate training results. With the method he provided is possible to print the intermediate results of the evaluation metric. As I want to extract the intermediate results of precision, recall, f1 and confusion matrix I did some changes to the method he implemented:
TestResult = collections.namedtuple("TestResult", ["params", "metrics"])
class CrossValidatorVerbose(CrossValidator):
def _fit(self, dataset):
folds = []
est = self.getOrDefault(self.estimator)
epm = self.getOrDefault(self.estimatorParamMaps)
numModels = len(epm)
eva = self.getOrDefault(self.evaluator)
metricName = eva.getMetricName()
nFolds = self.getOrDefault(self.numFolds)
seed = self.getOrDefault(self.seed)
h = 1.0 / nFolds
randCol = self.uid + "_rand"
df = dataset.select("*", rand(seed).alias(randCol))
metrics = [0.0] * numModels
for i in range(nFolds):
folds.append([])
foldNum = i + 1
print("Comparing models on fold %d" % foldNum)
validateLB = i * h
validateUB = (i + 1) * h
condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
validation = df.filter(condition)
train = df.filter(~condition)
for j in range(numModels):
paramMap = epm[j]
model = est.fit(train, paramMap)
# TODO: duplicate evaluator to take extra params from input
prediction = model.transform(validation, paramMap)
metric = eva.evaluate(prediction)
metrics[j] += metric
avgSoFar = metrics[j] / foldNum
print("params: %s\t%s: %f\tavg: %f" % (
{param.name: val for (param, val) in paramMap.items()},
metricName, metric, avgSoFar))
predictionLabels = prediction.select("prediction", "label")
allMetrics = MulticlassMetrics(predictionLabels.rdd)
folds[i].append(TestResult(paramMap.items(), allMetrics))
if eva.isLargerBetter():
bestIndex = np.argmax(metrics)
else:
bestIndex = np.argmin(metrics)
bestParams = epm[bestIndex]
bestModel = est.fit(dataset, bestParams)
avgMetrics = [m / nFolds for m in metrics]
bestAvg = avgMetrics[bestIndex]
print("Best model:\nparams: %s\t%s: %f" % (
{param.name: val for (param, val) in bestParams.items()},
metricName, bestAvg))
return self._copyValues(CrossValidatorModel(bestModel, avgMetrics)), folds
To use it just replace the CrossValidator method with CrossValidatorVerbose and when doing the model fitting do:
cvModel, folds = crossval.fit(df)
To print the metrics of a specific fold (1st fold with the 1st set of hyperparameters):
def printMetrics(metrics, df):
labels = df.rdd.map(lambda lp: lp.label).distinct().collect()
for label in sorted(labels):
print("Class %s precision = %s" % (label, metrics.precision(label)))
print("Class %s recall = %s" % (label, metrics.recall(label)))
print("Class %s F1 Measure = %s" % (label, metrics.fMeasure(label, beta=1.0)))
print ""
# Weighted stats
print("Weighted recall = %s" % metrics.weightedRecall)
print("Weighted precision = %s" % metrics.weightedPrecision)
print("Weighted F(1) Score = %s" % metrics.weightedFMeasure())
print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5))
print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate)
print("Accuracy = %s" % metrics.accuracy)
printMetrics(folds[0][0].metrics, df)
Will print something like:
Class 0.0 precision = 0.809523809524
Class 0.0 recall = 0.772727272727
Class 0.0 F1 Measure = 0.790697674419
Class 1.0 precision = 0.857142857143
Class 1.0 recall = 0.818181818182
Class 1.0 F1 Measure = 0.837209302326
Class 2.0 precision = 0.875
Class 2.0 recall = 0.875
Class 2.0 F1 Measure = 0.875
...
Weighted recall = 0.808333333333
Weighted precision = 0.812411616162
Weighted F(1) Score = 0.808461689698
Weighted F(0.5) Score = 0.810428077222
Weighted false positive rate = 0.026335560185
Accuracy = 0.808333333333