How to evaluate a classifier with PySpark 2.4.5
Asked Answered
I

1

23

I'm wondering what the best way is to evaluate a fitted binary classification model using Apache Spark 2.4.5 and PySpark (Python). I want to consider different metrics such as accuracy, precision, recall, auc and f1 score.

Let us assume that the following is given:

# pyspark.sql.dataframe.DataFrame in VectorAssembler format containing two columns: target and features
# DataFrame we want to evaluate
df

# Fitted pyspark.ml.tuning.TrainValidationSplitModel (any arbitrary ml algorithm)
model

1. Option

Neither BinaryClassificationEvaluator nor MulticlassClassificationEvaluator can calculate all metrics mentioned above on their own. Thus, we use both evaluators.

from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

# Create both evaluators
evaluatorMulti = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction")
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="prediction", metricName='areaUnderROC')

# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")

# Get metrics
acc = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "accuracy"})
f1 = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "f1"})
weightedPrecision = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedPrecision"})
weightedRecall = evaluatorMulti.evaluate(predictionAndTarget, {evaluatorMulti.metricName: "weightedRecall"})
auc = evaluator.evaluate(predictionAndTarget)

Downside

  • It seems weird and contradictory to use MulticlassClassificationEvaluator when evaluating a binary classifier
  • I have to use two different evaluators to calculate five metrics
  • MulticlassClassificationEvaluator only calculates weightedPrecision and weightedRecall (which is ok for a multi class classification). However, are these two metrics equal to precision and recall in a binary case ?

2. Option

Using RDD based API with BinaryClassificatinMetrics and MulticlassMetrics. Again, both metrics can't calculate all metrics mentioned above on their own (at least not in python ..). Thus, we use both.

from pyspark.mllib.evaluation import BinaryClassificationMetrics, MulticlassMetrics

# Make prediction
predictionAndTarget = model.transform(df).select("target", "prediction")

# Create both evaluators
metrics_binary = BinaryClassificationMetrics(predictionAndTarget.rdd.map(tuple))
metrics_multi = MulticlassMetrics(predictionAndTarget.rdd.map(tuple))

acc = metrics_multi.accuracy
f1 = metrics_multi.fMeasure(1.0)
precision = metrics_multi.precision(1.0)
recall = metrics_multi.recall(1.0)
auc = metrics_binary.areaUnderROC

Downsides

  • According to Spark, RDD-based API is now in maintenance mode and DataFrame-based API is primary API
  • Again, I have to use two different metrics to calculate five metrics
  • Again, using MulticlassMetrics seems contradictory when evaluating a binary classifier

Upside

  • In my case (~1.000.000 rows) Option 2 seems to be faster than Option 1

Surprise

  • In my case I get different f1 and areaUnderRoc values when using Option 1 vs when using Option 2.

Option 3

Use numpy and sklearn

import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, f1_score

# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")

predictionAndTargetNumpy = np.array((predictionAndTarget.collect()))

acc = accuracy_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
f1 = f1_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
precision = precision_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
recall = recall_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])
auc = roc_auc_score(predictionAndTargetNumpy[:,0], predictionAndTargetNumpy[:,1])

Downside

  • Using sklearn and numpy seems to be weird as Apache Spark is claiming to have there own API for Evaluation
  • Using numpy and sklearn can even be impossible if the dataset is getting too big.

To summarize my questions:

  1. Which Option above (if any) is recommended for evaluating a binary classifier using Apache Spark 2.4.5 and PySpark.
  2. Are there other Options? Am I missing something important?
  3. Why do I get different results for the metrics when using Option 1 vs when using Option 2
Insidious answered 20/3, 2020 at 10:23 Comment(2)
I tried your first method and it works only for accuracy.Nikianikita
Having tried to use Spark's MulticlassClassificationEvaluator, I can only second OP's comments above. It is confusing that not all metrics are provided by one evaluator class. It would be great to have something like sklearn's classification_report in Spark, too.Helmand
A
3

Not sure if it is relevant now , but can answer your question 3 and thus may be question 1 inturn-

Spark ML provides Weighted Precision & Weighted Recall metrics only as part of MulticlassClassificationEvaluator module. If you're looking to have equivalent interpretation of Overall Precision metric, especially incase of Binary Classification equivalent to Scikit world , then better to compute Confusion Matrix and evaluate using the formula of Precision & Recall

Weighted precision ,used by Spark ML ,is computed using precision of both the classes and then adding using weight of each class label in test set i.e.

Prec (Label 1) = TP/(TP+FP)
Prec (Label 0) = TN/(TN+FN)
Weight of Label 1 in test set WL1 = L1/(L1+L2)
Weight of Label 0 in test set WL2 = L2/(L1+L2)
Weighted precision = (PrecL1 * WL1) + (PrecL0 * WL2)

Weighted Precision &Recall will be more than Overall Precision & Recall in case of even slight class imbalance in the dataset and thus metrics between Sklearn based & Spark ML based will differ.

As an illustration , a Confusion Matrix of class imbalance dataset as below :

 array([[3969025,  445123],
       [ 284283, 1663913]])
 
 Total 1 Class labels   1948196
 Total 0 Class labels   4414148

 Proportion Label 1 :0.306207272
 Proportion Label 0 :0.693792728


Spark ML will give metrics :
Accuracy : 0.8853557745384405
Weighted Precision : 0.8890015815237463
WeightedRecall :    0.8853557745384406
F-1 Score  :  0.8865644697253956

whereas Actual Overall metrics computation gives (Scikit Equivalent):

 Accuracy:  0.8853557745384405
 Precision: 0.7889448070113549
 Recall:    0.8540788503826103
 AUC:   0.8540788503826103
 f1:    0.8540788503826103

Thus Spark ML weighted version inflates the otherwise Overall metric computation that we observe especially for Binary Classification

Amalle answered 13/7, 2020 at 10:15 Comment(4)
Thanks for your explanation! However, my question was why I got different results when using Option 1 (Spark ML DataFrame based API) vs when using Option 2 (Spark ML RDD based API). And I'm not sure how your explanation is answering my first question (Which option is recommended)?Insidious
Shouldn't it be "Total 1 Class labels" 3969025 + 445123=4414148 ?Cauthen
I'm sorry but if the weight for class 1 is basically L1=(TP+FP)/(TP+FP+FN+TN) and I have to multiply by TP/(TP+FP) and then for class 0 L0=(TN+FN)/(TP+FP+FN+TN) and TN/(TN+FN) , if I add together I basically get the accuracy (TP + TN)/(TP + TN + FP + FN)Cauthen
Good approach for pure evaluation, but hot to use it as evaluator for CrossValidator object? I mean, how to run CrossValidator based on f1(scikit learn equivalent)Hither

© 2022 - 2024 — McMap. All rights reserved.