Dealing with unbalanced datasets in Spark MLlib
Asked Answered
B

3

36

I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems using Spark's MLlib.

I'm using MLLib's Random Forest implementation and already tried the simplest approach of randomly undersampling the larger class but it didn't work as well as I expected.

I would appreciate any feedback regarding your experience with similar issues.

Thanks,

Butyraceous answered 27/10, 2015 at 16:4 Comment(6)
SMOTEBoost algorithm suggests to train the dataset with a weak learner algorithm. Why don't you implement something like that: issues.apache.org/jira/browse/SPARK-1546Lymphoma
@eliasah, what I meant is that my that my dataset contains very few positive examples compared to the negative ones (about 1 every 100). The trained classifier is biased towards the majority (negative) class having higher predictive accuracy over this class, but poorer predictive accuracy over the minority class. The "didn't work as expected" meant that the precision of the classifier is about 60-70% (i.e. 60-70% of the positive cases are classified correctly), when doing 10 fold cross validation testing.Butyraceous
This means you'll need to be tuning your model parameters, or maybe RF is not a good fit for your data model. Did you perform a grid search to find your parameters configurations?Enrobe
How connected and dense is your positive class? Are the features discrete or continuous? RF works well for discrete data on sets with discrete data that is locally connected. If the points are globally connected (one big clump), then you might consider SVM, spectral clustering, or even k-means.Wither
@Enrobe "Binary classification isn't affected by unbalanced data". Do you have any reference for this claim? I am not saying it's not true, but it is not intuitive at least for me.Hemorrhoidectomy
"Binary classification isn't affected by unbalanced data" - this is absolutely not true.Kerb
K
56

Class weight with Spark ML

As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here)

But If you're willing to try other classifiers - this functionality has been already added to the Logistic Regression.

Consider a case where we have 80% positives (label == 1) in the dataset, so theoretically we want to "under-sample" the positive class. The logistic loss objective function should treat the negative class (label == 0) with higher weight.

Here is an example in Scala of generating this weight, we add a new column to the dataframe for each record in the dataset:

def balanceDataset(dataset: DataFrame): DataFrame = {

    // Re-balancing (weighting) of records to be used in the logistic loss objective function
    val numNegatives = dataset.filter(dataset("label") === 0).count
    val datasetSize = dataset.count
    val balancingRatio = (datasetSize - numNegatives).toDouble / datasetSize

    val calculateWeights = udf { d: Double =>
      if (d == 0.0) {
        1 * balancingRatio
      }
      else {
        (1 * (1.0 - balancingRatio))
      }
    }

    val weightedDataset = dataset.withColumn("classWeightCol", calculateWeights(dataset("label")))
    weightedDataset
  }

Then, we create a classier as follow:

new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")

For more details, watch here: https://issues.apache.org/jira/browse/SPARK-9610

- Predictive Power

A different issue you should check - whether your features have a "predictive power" for the label you're trying to predict. In a case where after under-sampling you still have low precision, maybe that has nothing to do with the fact that your dataset is imbalanced by nature.


I would do a exploratory data analysis - If the classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class.

  • Perform correlation analysis for every feature with the label.
  • Generating class specific histograms for features (i.e. plotting histograms of the data for each class, for a given feature on the same axis) can also be a good way to show if a feature discriminates well between the two classes.

Overfitting - a low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set.


Bias variance - Check whether your classifier suffers from a high bias or high variance problem.

  • Training error vs. validation error - graph the validation error and training set error, as a function of training examples (do incremental learning)
    • If the lines seem to converge to the same value and are close at the end, then your classifier has high bias. In such case, adding more data won't help. Change the classifier for a one that has higher variance, or simply lower the regularization parameter of your current one.
    • If on the other hand the lines are quite far apart, and you have a low training set error but high validation error, then your classifier has too high variance. In this case getting more data is very likely to help. If after getting more data the variance will still be too high, you can increase the regularization parameter.
Kerb answered 15/8, 2016 at 8:17 Comment(7)
Thanks for the pointers @Serendipity. I wasn't aware that Logistic Regression in Spark ML supported class weights.Butyraceous
@Butyraceous do you need an example of the implementation? I've just tried it out.Kerb
Thanks @Serendipity! One thing I'm noticing is that when the classifier is trained over a weighted data set, the output probabilities (I need actual probabilities and not the predicted label) are not well calibrated. This means that the resulting probabilities doesn't match the original data set distribution, but are adjusted to the weighted dataset. This, in turn, causes a higher log loss measure over the validation set than when manual undersampling the original training set and manually calibrating the classifiers output probabilities.Butyraceous
This was of great help, thanks. The fact that this isn't documented anywhere, there are no examples and etc, and you had to reference the GH PR and the JIRA tasks is blowing my mind. Such a great feature is present in the ml library and the only way to find out about it is by digging in GH PR/Spark source code/JIRAS. Spark has the worse documentation by far and that's too bad.Southerland
@Butyraceous Would like to elaborate that comment of your to answer while showing how you manually under-sampled and calibrated the classifiers?Southerland
@Kerb Thank you so much for this answer. I am trying to do the same but using Python not Scala. but I couldn't access the class column. Here is my code : def calculateWeights(d): if d == 0: return 1 * (dataset.count() - dataset.filter(col("kategorie1") == 0).count()) / dataset.count() else: return 1- ((dataset.count() - dataset.filter(col("kategorie1") == 0).count()) / dataset.count()) weightedDataset = dataset.withColumn("classWeightCol", calculateWeights(col("kategorie1")))Yeta
Hi @EmnaJaoua, Do you mean the function "WithColumn()"? What error are you getting? It is a but hard to understand the code from you comment. Can you please open a new question with the error, the code for replication you problem (maybe upload sample of your data as well? iff you can) and you can then reference to this question.Kerb
N
4

I used the solution by @Serendipity, but we can optimize the balanceDataset function to avoid using a udf. I also added the ability to change the label column being used. This is the version of the function I ended up with:

def balanceDataset(dataset: DataFrame, label: String = "label"): DataFrame = {
  // Re-balancing (weighting) of records to be used in the logistic loss objective function
  val (datasetSize, positives) = dataset.select(count("*"), sum(dataset(label))).as[(Long, Double)].collect.head
  val balancingRatio = positives / datasetSize

  val weightedDataset = {
    dataset.withColumn("classWeightCol", when(dataset(label) === 0.0, balancingRatio).otherwise(1.0 - balancingRatio))
  }
  weightedDataset
}

We create the classifier as he stated wtih:

new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")
Nuris answered 7/4, 2017 at 13:2 Comment(0)
H
0

@dbakr Did you get an answer for your biased prediction on your imbalanced dataset ?

Though I'm not sure it was your original plan, note that if you first subsample the majority class of your dataset by a ratio r, then, in order to get unbaised predictions for Spark's logistic regression, you can either: - use the rawPrediction provided by the transform() function and adjust the intercept with log(r) - or you can train your regression with weights using .setWeightCol("classWeightCol") (see the article cited here to figure out the value that must be set in the weights).

Homespun answered 22/8, 2017 at 17:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.