How to increase the model accuracy of logistic regression in Scikit python?
Asked Answered
H

2

27

I am trying to predict the admit variable with predictors such as gre,gpa and ranks. But the prediction accuracy is very low (0.66).The dataset is given below.

https://gist.github.com/abyalias/3de80ab7fb93dcecc565cee21bd9501a

The first few rows of the dataset looks like:

   admit  gre   gpa  rank_2  rank_3  rank_4
0      0  380  3.61     0.0     1.0     0.0
1      1  660  3.67     0.0     1.0     0.0
2      1  800  4.00     0.0     0.0     0.0
3      1  640  3.19     0.0     0.0     1.0
4      0  520  2.93     0.0     0.0     1.0
5      1  760  3.00     1.0     0.0     0.0
6      1  560  2.98     0.0     0.0     0.0

My code:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

y = data['admit']
x = data[data.columns[1:]]

xtrain, xtest, ytrain, ytest = train_test_split(x, y, random_state=2)

#modelling 
clf = LogisticRegression(penalty='l2')
clf.fit(xtrain, ytrain)
ypred_train = clf.predict(xtrain)
ypred_test = clf.predict(xtest)

#checking the classification accuracy
accuracy_score(ytrain, ypred_train)
# 0.70333333333333337
accuracy_score(ytest, ypred_test)
# 0.66000000000000003

#confusion metrix...
confusion_matrix(ytest, ypred)
# array([[62,  1],
#        [33,  4]])

The ones are wrongly predicted. How do I increase the model accuracy?

Honeydew answered 28/6, 2016 at 13:9 Comment(5)
You could start by tuning the C parameter of logistic regression. You could also try different classification methods like SVMs and trees.Averi
You should not try to be optimising the accuracy on your test set. You should optimise on the training set and use the test set as an object evaluation of the method. Can you edit your answer to show the accuracy score based on the training set?Thanos
Hi,accuracy based on training set is added.Honeydew
@geompalik,I tried with putting C=0.01,100.when 100,the accuracy on training set is increased to 72.66% and accuracy on test set is 68.99%.But still no remarkable differenceHoneydew
Two points: (i) Evaluating a model on the training set as indicated by ncfirth above, is a bad practice in general since a model fits the training data and such a score would not say anything about its generalizing ability. You should opt for cross-validation. (ii) I agree with the points of Abhinav below. I would suggest to try normalizing your gre and gpa, because their values dominate your feature vectors. Try for example: scikit-learn.org/stable/modules/generated/…Averi
M
95

Since machine learning is more about experimenting with the features and the models, there is no correct answer to your question. Some of my suggestions to you would be:

1. Feature Scaling and/or Normalization - Check the scales of your gre and gpa features. They differ on 2 orders of magnitude. Therefore, your gre feature will end up dominating the others in a classifier like Logistic Regression. You can normalize all your features to the same scale before putting them in a machine learning model.This is a good guide on the various feature scaling and normalization classes available in scikit-learn.

2. Class Imbalance - Look for class imbalance in your data. Since you are working with admit/reject data, then the number of rejects would be significantly higher than the admits. Most classifiers in SkLearn including LogisticRegression have a class_weight parameter. Setting that to balanced might also work well in case of a class imbalance.

3. Optimize other scores - You can optimize on other metrics also such as Log Loss and F1-Score. The F1-Score could be useful, in case of class imbalance. This is a good guide that talks more about scoring.

4. Hyperparameter Tuning - Grid Search - You can improve your accuracy by performing a Grid Search to tune the hyperparameters of your model. For example in case of LogisticRegression, the parameter C is a hyperparameter. Also, you should avoid using the test data during grid search. Instead perform cross validation. Use your test data only to report the final numbers for your final model. Please note that GridSearch should be done for all models that you try because then only you will be able to tell what is the best you can get from each model. Scikit-Learn provides the GridSearchCV class for this. This article is also a good starting point.

5. Explore more classifiers - Logistic Regression learns a linear decision surface that separates your classes. It could be possible that your 2 classes may not be linearly separable. In such a case you might need to look at other classifiers such Support Vector Machines which are able to learn more complex decision boundaries. You can also start looking at Tree-Based classifiers such as Decision Trees which can learn rules from your data. Think of them as a series of If-Else rules which the algorithm automatically learns from the data. Often, it is difficult to get the right Bias-Variance Tradeoff with Decision Trees, so I would recommend you to look at Random Forests if you have a considerable amount of data.

6. Error Analysis - For each of your models, go back and look at the cases where they are failing. You might end up finding that some of your models work well on one part of the parameter space while others work better on other parts. If this is the case, then Ensemble Techniques such as VotingClassifier techniques often give the best results. Models that win Kaggle competitions are many times ensemble models.

7. More Features _ If all of this fails, then that means that you should start looking for more features.

Mcferren answered 28/6, 2016 at 17:59 Comment(2)
Nice answer. Can you please elaborate on You can optimize on other metrics also such as Log Loss and F1-Score. How do we do this? I appreciate any help!Thegn
Regarding 4. Hyperparameters tuning, bayesian optimization gets people exciting these days. It shall offer the right balance between model performance versus number of hyperparameters combinations tested.Corrugate
C
0

A relatively easy way to try out is to add polynomial features. You can tune the degrees required.

Also, check out the benchmark model results. The confusion matrix of the benchmark model (in the OP) shows that almost no positive predictions are being made on the test data. One reason for this could be that outliers are "confusing" the model. In fact, the box plot of admit vs gre of the training data of the dataset in the OP looks as follows:

boxplot

There are people who weren't admitted even though their GRE score was 800 and there are people who were admitted even though their GRE score was less than 350. We could simply remove these outliers from the training dataset and train LogisticRegression on the trimmed data. It's important to note that you shouldn't remove anything from the test set as it's a held-out set that is assumed to be unseen data. So any outlier handling should be done only on the training set.

Also as Abhinav Arora mentioned, feature normalization is also something you can try with minimal fuss.

All in all, with very minimal additional code, the test accuracy was improved by 10 percentage points from 0.66 to 0.76.

import pandas as pd
# import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LogisticRegression

# read data
data = pd.read_csv("https://gist.githubusercontent.com/abyalias/3de80ab7fb93dcecc565cee21bd9501a/raw/d9d70f7e16082b09850aa545db86897c68ac3e71/gpa_final.csv", sep='\t')

# split into train and test data
xtrain, xtest = train_test_split(data, random_state=2)

# boxplot
# sns.boxplot(x='admit', y='gre', data=xtrain);

# remove outliers from training set
ztrain = xtrain.query("not ((admit==1 and gre < 350) or (admit==0 and gre>=800))")

# split into x and y variables
ytrain = ztrain.pop('admit')
ytest = xtest.pop('admit')

# normalize the data
sc = StandardScaler()
ztrain = sc.fit_transform(ztrain)
ztest = sc.transform(xtest)

# add polynomial features
poly = PolynomialFeatures(degree=4)
ztrain = poly.fit_transform(ztrain)
ztest = poly.transform(ztest)

# model
clf = LogisticRegression(penalty='none', max_iter=1000)
clf.fit(ztrain, ytrain)

# checking accuracy
print("Train accuracy =", clf.score(ztrain, ytrain))   # 0.7665505226480837
print("Test accuracy  =", clf.score(ztest, ytest))     # 0.76

That said, for other datasets, it's very possible that adding polynomial features, normalization, handling outliers etc. simply cannot improve accuracy because the data is too limiting. In that case, you'll need to get more data to come up with more predictive features.

Cheep answered 10/3, 2023 at 22:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.