How to calculate logistic regression accuracy
Asked Answered
R

4

5

I am a complete beginner in machine learning and coding in python, and I have been tasked with coding logistic regression from scratch to understand what happens under the hood. So far I have coded for the hypothesis function, cost function and gradient descent, and then coded for the logistic regression. However on coding for printing the accuracy I get a low output (0.69) which doesnt change with increasing iterations or changing the learning rate. My question is, is there a problem with my accuracy code below? Any help pointing to the right direction would be appreciated

X = data[['radius_mean', 'texture_mean', 'perimeter_mean',
   'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
   'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
   'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
   'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
   'fractal_dimension_se', 'radius_worst', 'texture_worst',
   'perimeter_worst', 'area_worst', 'smoothness_worst',
   'compactness_worst', 'concavity_worst', 'concave points_worst',
   'symmetry_worst', 'fractal_dimension_worst']]
X = np.array(X)
X = min_max_scaler.fit_transform(X)
Y = data["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)

X = data["diagnosis"].map(lambda x: float(x))

def Sigmoid(z):
    if z < 0:
        return 1 - 1/(1 + math.exp(z))
    else:
        return 1/(1 + math.exp(-z))

def Hypothesis(theta, x):
    z = 0
    for i in range(len(theta)):
        z += x[i]*theta[i]
    return Sigmoid(z)

def Cost_Function(X,Y,theta,m):
    sumOfErrors = 0
    for i in range(m):
        xi = X[i]
        hi = Hypothesis(theta,xi)
        error = Y[i] * math.log(hi if  hi >0 else 1)
        if Y[i] == 1:
            error = Y[i] * math.log(hi if  hi >0 else 1)
        elif Y[i] == 0:
            error = (1-Y[i]) * math.log(1-hi  if  1-hi >0 else 1)
        sumOfErrors += error

    constant = -1/m
    J = constant * sumOfErrors
    #print ('cost is: ', J ) 
    return J

def Cost_Function_Derivative(X,Y,theta,j,m,alpha):
    sumErrors = 0
    for i in range(m):
        xi = X[i]
        xij = xi[j]
        hi = Hypothesis(theta,X[i])
        error = (hi - Y[i])*xij
        sumErrors += error
    m = len(Y)
    constant = float(alpha)/float(m)
    J = constant * sumErrors
    return J

def Gradient_Descent(X,Y,theta,m,alpha):
    new_theta = []
    constant = alpha/m
    for j in range(len(theta)):
        CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha)
        new_theta_value = theta[j] - CFDerivative
        new_theta.append(new_theta_value)
    return new_theta


def Accuracy(theta):
    correct = 0
    length = len(X_test, Hypothesis(X,theta))
    for i in range(length):
        prediction = round(Hypothesis(X[i],theta))
        answer = Y[i]
    if prediction == answer.all():
            correct += 1
    my_accuracy = (correct / length)*100
    print ('LR Accuracy %: ', my_accuracy)



def Logistic_Regression(X,Y,alpha,theta,num_iters):
    theta = np.zeros(X.shape[1])
    m = len(Y)
    for x in range(num_iters):
        new_theta = Gradient_Descent(X,Y,theta,m,alpha)
        theta = new_theta
        if x % 100 == 0:
            Cost_Function(X,Y,theta,m)
            print ('theta: ', theta)    
            print ('cost: ', Cost_Function(X,Y,theta,m))
    Accuracy(theta)

initial_theta = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]  
alpha = 0.0001
iterations = 1000
Logistic_Regression(X,Y,alpha,initial_theta,iterations)

This is using data from the wisconsin breast cancer dataset (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) where I am weighing in 30 features - although changing the features to ones which are known to correlate also doesn't change my accuracy.

Ranger answered 22/11, 2017 at 15:4 Comment(4)
consider using sklearn accuracy_score to check if it produces the same accuracy rate, scikit-learn.org/stable/modules/generated/…Cohlette
What is all in answer.all()? Why not simply if prediction == answer inside the for loop??Disinfest
I would think the cost function and the gradient descent function would be likely candidates for errors, but you haven't shown them. Are you certain they are correct? Also, there is some strange stuff in this code: why are you calling Cost_Function(X,Y,theta,m) and not saving the results? You passing two arguments to len(), etc.Anderer
I've updated to include almost all of my code, and I will look into how I am calling for the Cost_Function and len() - thank you for the helpRanger
A
3

I'm not sure how you arrived at a value of 0.0001 for alpha, but I think it's too low. Using your code with the cancer data shows that cost is decreasing with each iteration -- it's just going glacially.

When I raise this to 0.5, I still get a decreasing costs, but at a more reasonable level. After 1000 iterations it reports:

cost:  0.23668000993020666

And after fixing the Accuracy function I'm getting 92% on the test segment of the data.

You have Numpy installed, as shown by X = np.array(X). You should really consider using it for your operations. It will be orders of magnitude faster for jobs like this. Here is a vectorized version that gives results instantly rather than waiting:

import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv("cancerdata.csv")
X = df.values[:,2:-1].astype('float64')
X = (X - np.mean(X, axis =0)) /  np.std(X, axis = 0)

## Add a bias column to the data
X = np.hstack([np.ones((X.shape[0], 1)),X])
X = MinMaxScaler().fit_transform(X)
Y = df["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)


def Sigmoid(z):
    return 1/(1 + np.exp(-z))

def Hypothesis(theta, x):   
    return Sigmoid(x @ theta) 

def Cost_Function(X,Y,theta,m):
    hi = Hypothesis(theta, X)
    _y = Y.reshape(-1, 1)
    J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
    return J

def Cost_Function_Derivative(X,Y,theta,m,alpha):
    hi = Hypothesis(theta,X)
    _y = Y.reshape(-1, 1)
    J = alpha/float(m) * X.T @ (hi - _y)
    return J

def Gradient_Descent(X,Y,theta,m,alpha):
    new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
    return new_theta

def Accuracy(theta):
    correct = 0
    length = len(X_test)
    prediction = (Hypothesis(theta, X_test) > 0.5)
    _y = Y_test.reshape(-1, 1)
    correct = prediction == _y
    my_accuracy = (np.sum(correct) / length)*100
    print ('LR Accuracy %: ', my_accuracy)

def Logistic_Regression(X,Y,alpha,theta,num_iters):
    m = len(Y)
    for x in range(num_iters):
        new_theta = Gradient_Descent(X,Y,theta,m,alpha)
        theta = new_theta
        if x % 100 == 0:
            #print ('theta: ', theta)    
            print ('cost: ', Cost_Function(X,Y,theta,m))
    Accuracy(theta)

ep = .012

initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 2000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)

I think I might have a different versions of scikit, because I had change the MinMaxScaler line to make it work. The result is that I can 10K iterations in the blink of an eye and the results of the applying the model to the test set is about 97% accuracy.

Anderer answered 23/11, 2017 at 0:45 Comment(5)
Thank you for this response, now I can see what I need to learn about in more detail, and how your code is improving the speed. Do you possibly know what scikit version you have? I have tried to run the code you've given here (I use scikit from downloading anaconda v3.6.3) but I get the following error: \Anaconda3\lib\site-packages\ipykernel_launcher.py:7: RuntimeWarning: invalid value encountered in greater import sysRanger
Might it also be an issue that I have this at the start of my code when I upload my file (as I based my feature range off of the largest value in the dataset): min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0,5000)) data = pd.read_csv("data.csv",header=0)Ranger
Changed this initial part of my code and now it works like yours - so now I understand, thank you!Ranger
Hi, I don't know if you'll have a chance to see this reply but why is it you have ep = .012 in this code? it's the last part of this that I don't understandRanger
It’s not important. I like to set the initial theta to random non zero numbers. That number was just in my head. Using it with rand() like this should give numbers between +/- .012. It was a bad choice for an example because it seems very specific, but it’s notAnderer
F
13

Python gives us this scikit-learn library that makes our work easier, this worked for me:

from sklearn.metrics import accuracy_score

y_pred = log.predict(x_test)

score =accuracy_score(y_test,y_pred)
Frost answered 4/7, 2019 at 18:16 Comment(0)
F
7

Accuracy is one of the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. Higher accuracy means model is preforming better.

Accuracy = TP+TN/TP+FP+FN+TN

TP = True positives
TN = True negatives
FN = False negatives
TN = True negatives

While you are using accuracy measure your false positives and false negatives should be of similar cost. A better metric is the F1-score which is given by

F1-score = 2*(Recall*Precision)/Recall+Precision where,

Precision = TP/TP+FP
Recall = TP/TP+FN

Read more here

https://en.wikipedia.org/wiki/Precision_and_recall

The beauty about machine learning in python is that important modules like scikit-learn is open source so you can always look at the actual code. Please use the below link to scikit learn metrics source code which will give you an idea how scikit-learn calculates the accuracy score when you do

from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/metrics

Facelift answered 23/11, 2017 at 5:42 Comment(1)
Thank you for these resources, I will have a look and also try implementing these different metrics tooRanger
A
3

I'm not sure how you arrived at a value of 0.0001 for alpha, but I think it's too low. Using your code with the cancer data shows that cost is decreasing with each iteration -- it's just going glacially.

When I raise this to 0.5, I still get a decreasing costs, but at a more reasonable level. After 1000 iterations it reports:

cost:  0.23668000993020666

And after fixing the Accuracy function I'm getting 92% on the test segment of the data.

You have Numpy installed, as shown by X = np.array(X). You should really consider using it for your operations. It will be orders of magnitude faster for jobs like this. Here is a vectorized version that gives results instantly rather than waiting:

import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv("cancerdata.csv")
X = df.values[:,2:-1].astype('float64')
X = (X - np.mean(X, axis =0)) /  np.std(X, axis = 0)

## Add a bias column to the data
X = np.hstack([np.ones((X.shape[0], 1)),X])
X = MinMaxScaler().fit_transform(X)
Y = df["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)


def Sigmoid(z):
    return 1/(1 + np.exp(-z))

def Hypothesis(theta, x):   
    return Sigmoid(x @ theta) 

def Cost_Function(X,Y,theta,m):
    hi = Hypothesis(theta, X)
    _y = Y.reshape(-1, 1)
    J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
    return J

def Cost_Function_Derivative(X,Y,theta,m,alpha):
    hi = Hypothesis(theta,X)
    _y = Y.reshape(-1, 1)
    J = alpha/float(m) * X.T @ (hi - _y)
    return J

def Gradient_Descent(X,Y,theta,m,alpha):
    new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
    return new_theta

def Accuracy(theta):
    correct = 0
    length = len(X_test)
    prediction = (Hypothesis(theta, X_test) > 0.5)
    _y = Y_test.reshape(-1, 1)
    correct = prediction == _y
    my_accuracy = (np.sum(correct) / length)*100
    print ('LR Accuracy %: ', my_accuracy)

def Logistic_Regression(X,Y,alpha,theta,num_iters):
    m = len(Y)
    for x in range(num_iters):
        new_theta = Gradient_Descent(X,Y,theta,m,alpha)
        theta = new_theta
        if x % 100 == 0:
            #print ('theta: ', theta)    
            print ('cost: ', Cost_Function(X,Y,theta,m))
    Accuracy(theta)

ep = .012

initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 2000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)

I think I might have a different versions of scikit, because I had change the MinMaxScaler line to make it work. The result is that I can 10K iterations in the blink of an eye and the results of the applying the model to the test set is about 97% accuracy.

Anderer answered 23/11, 2017 at 0:45 Comment(5)
Thank you for this response, now I can see what I need to learn about in more detail, and how your code is improving the speed. Do you possibly know what scikit version you have? I have tried to run the code you've given here (I use scikit from downloading anaconda v3.6.3) but I get the following error: \Anaconda3\lib\site-packages\ipykernel_launcher.py:7: RuntimeWarning: invalid value encountered in greater import sysRanger
Might it also be an issue that I have this at the start of my code when I upload my file (as I based my feature range off of the largest value in the dataset): min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0,5000)) data = pd.read_csv("data.csv",header=0)Ranger
Changed this initial part of my code and now it works like yours - so now I understand, thank you!Ranger
Hi, I don't know if you'll have a chance to see this reply but why is it you have ep = .012 in this code? it's the last part of this that I don't understandRanger
It’s not important. I like to set the initial theta to random non zero numbers. That number was just in my head. Using it with rand() like this should give numbers between +/- .012. It was a bad choice for an example because it seems very specific, but it’s notAnderer
D
0

This also works using Vectorization to calculate the accuracy But Accuracy is not recommended metric as the above Answer noted (if the data is not well_blanced you should not use accuracy instead you use F1-score)

clf = sklearn.linear_model.LogisticRegressionCV();
    clf.fit(X.T, Y.T);
    LR_predictions = clf.predict(X.T)
    print ('Accuracy of logistic regression: %d ' % float((np.dot(Y,LR_predictions) + np.dot(1-Y,1-LR_predictions))/float(Y.size)*100) +
           '% ' + "(percentage of correctly labelled datapoints)")
Doner answered 5/8, 2021 at 15:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.