I'm wondering what the best way is to evaluate a fitted binary classification model using Apache Spark 2.4.5 and PySpark (Python). I want to consider different metrics such as accuracy, precision, recall, auc and f1 score.
Let us assume that the following is given:
# pyspark.sql.dataframe.DataFrame in VectorAssembler format containing two columns: target and features
# DataFrame we want to evaluate
df
# Fitted pyspark.ml.tuning.TrainValidationSplitModel (any arbitrary ml algorithm)
model
1. Option
Neither BinaryClassificationEvaluator nor MulticlassClassificationEvaluator can calculate all metrics mentioned above on their own. Thus, we use both evaluators.
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
# Create both evaluators
evaluatorMulti = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction")
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="prediction", metricName='areaUnderROC')
# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")
# Get metrics
acc = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "accuracy"})
f1 = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "f1"})
weightedPrecision = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedPrecision"})
weightedRecall = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedRecall"})
auc = evaluator.evaluate(predictionAndTarget)
Downside
- It seems weird and contradictory to use MulticlassClassificationEvaluator when evaluating a binary classifier
- I have to use two different evaluators to calculate five metrics
- MulticlassClassificationEvaluator only calculates
weightedPrecision
andweightedRecall
(which is ok for a multi class classification). However, are these two metrics equal toprecision
andrecall
in a binary case ?
2. Option
Using RDD based API with BinaryClassificatinMetrics and MulticlassMetrics. Again, both metrics can't calculate all metrics mentioned above on their own (at least not in python ..). Thus, we use both.
from pyspark.mllib.evaluation import BinaryClassificationMetrics, MulticlassMetrics
# Make prediction
predictionAndTarget = model.transform(df).select("target", "prediction")
# Create both evaluators
metrics_binary = BinaryClassificationMetrics(predictionAndTarget.rdd.map(tuple))
metrics_multi = MulticlassMetrics(predictionAndTarget.rdd.map(tuple))
acc = metrics_multi.accuracy
f1 = metrics_multi.fMeasure(1.0)
precision = metrics_multi.precision(1.0)
recall = metrics_multi.recall(1.0)
auc = metrics_binary.areaUnderROC
Downsides
- According to Spark, RDD-based API is now in maintenance mode and DataFrame-based API is primary API
- Again, I have to use two different metrics to calculate five metrics
- Again, using MulticlassMetrics seems contradictory when evaluating a binary classifier
Upside
- In my case (~1.000.000 rows) Option 2 seems to be faster than Option 1
Surprise
- In my case I get different
f1
andareaUnderRoc
values when using Option 1 vs when using Option 2.
Option 3
Use numpy and sklearn
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, f1_score
# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")
predictionAndTargetNumpy = np.array((predictionAndTarget.collect()))
acc = accuracy_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
f1 = f1_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
precision = precision_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
recall = recall_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
auc = roc_auc_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
Downside
- Using sklearn and numpy seems to be weird as Apache Spark is claiming to have there own API for Evaluation
- Using numpy and sklearn can even be impossible if the dataset is getting too big.
To summarize my questions:
- Which Option above (if any) is recommended for evaluating a binary classifier using Apache Spark 2.4.5 and PySpark.
- Are there other Options? Am I missing something important?
- Why do I get different results for the metrics when using Option 1 vs when using Option 2
MulticlassClassificationEvaluator
, I can only second OP's comments above. It is confusing that not all metrics are provided by one evaluator class. It would be great to have something like sklearn's classification_report in Spark, too. – Helmand