sklearn logistic regression with unbalanced classes
Asked Answered
E

2

24

I'm solving a classification problem with sklearn's logistic regression in python.

My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.

I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.

What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.

What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.

Ennoble answered 13/2, 2013 at 21:6 Comment(0)
I
24

Have you tried to pass to your class_weight="auto" classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.

Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).

Incidence answered 13/2, 2013 at 22:34 Comment(2)
Yes, class_weight='auto' works great. Is there any advantage to not use the built-in/black-box auto weight but instead to rebalance the training set (as I originally did)? Regardless, if I took the approach of balancing the training set, how do I adjust the fit/trained model to apply to an unbalanaced test set?Ennoble
It's not that black box: it just re-weighting the samples in the empirical objective function being optimized by the algorithm. Under-sampling over-represented classes is good because training is faster :) but you are dropping data which is bad, especially if your model is already in an overfitting regime (significant gap between train and test scores). Over-sampling is in generally mathematically equivalent to re-weighting but slower because of duplicated operations.Incidence
E
11

@agentscully Have you read the following paper,

[SMOTE] (https://www.jair.org/media/953/live-953-2037-jair.pdf). I have found the same very informative. Here is the link to the Repo. Depending on how you go about balancing your target classes, either you can use

  • 'auto': (is deprecated in the newer version 0.17) or 'balanced' or specify the class ratio yourself {0: 0.1, 1: 0.9}.
  • 'balanced': This mode adjusts the weights inversely proportional to class frequencies n_samples / (n_classes * np.bincount(y)

Let me know, if more insight is needed.

Eye answered 15/6, 2016 at 2:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.