Below is a dummy pandas.DataFrame
for example:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
df = pd.DataFrame({'X1':[100,120,140,200,230,400,500,540,600,625],
'X2':[14,15,22,24,23,31,33,35,40,40],
'Y':[0,0,0,0,1,1,1,1,1,1]})
Here we have 3 columns, X1,X2,Y
suppose X1 & X2
are your independent variables and 'Y'
column is your dependent variable.
X = df[['X1','X2']]
y = df['Y']
With sklearn.model_selection.train_test_split
you are creating 4 portions of data which will be used for fitting & predicting values.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state=42)
X_train, X_test, y_train, y_test
Now
1). X_train - This includes your all independent variables,these will be used to train the model, also as we have specified the test_size = 0.4
, this means 60%
of observations from your complete data will be used to train/fit the model and rest 40%
will be used to test the model.
2). X_test - This is remaining 40%
portion of the independent variables from the data which will not be used in the training phase and will be used to make predictions to test the accuracy of the model.
3). y_train - This is your dependent variable which needs to be predicted by this model, this includes category labels against your independent variables, we need to specify our dependent variable while training/fitting the model.
4). y_test - This data has category labels for your test data, these labels will be used to test the accuracy between actual and predicted categories.
Now you can fit a model on this data, let's fit sklearn.linear_model.LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train) #This is where the training is taking place
y_pred_logreg = logreg.predict(X_test) #Making predictions to test the model on test data
print('Logistic Regression Train accuracy %s' % logreg.score(X_train, y_train)) #Train accuracy
#Logistic Regression Train accuracy 0.8333333333333334
print('Logistic Regression Test accuracy %s' % accuracy_score(y_pred_logreg, y_test)) #Test accuracy
#Logistic Regression Test accuracy 0.5
print(confusion_matrix(y_test, y_pred_logreg)) #Confusion matrix
print(classification_report(y_test, y_pred_logreg)) #Classification Report
You can read more about metrics here
Read more about data split here
Hope this helps:)