Why GridSearchCV model results are different than the model I manually tuned?

this is my first question ever here I hope I am doing this right,

I was working on titanic dataset which is popular on kaggle, this tutarial if u wanna check A Data Science Framework: To Achieve 99% Accuracy

the part 5.2, it teaches how to gridsearch and tune hyper-parameters. let me share related codes with you before I get spesific on my question;

this is tuning the model with GridSearchCV:

cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)
#cv_split = model_selection.KFold(n_splits=10, shuffle=False, random_state=None)

param_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'], #splitting methodology; two supported strategies - default is best
'max_depth': [2,4,6,8,10,None], #max depth tree can grow; default is none
'min_samples_split': [2,5,10,.03,.05], #minimum subset size BEFORE new split (fraction is % of total); default is 2
'min_samples_leaf': [1,5,10,.03,.05], #minimum subset size AFTER new split split (fraction is % of total); default is 1
'max_features': [None, 'auto'], #max features to consider when performing split; default none or all
'random_state': [0] }

tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', return_train_score = True ,cv = cv_split)
tune_model.fit(data1[data1_x_bin], data1[Target])`

    tune_model.best_params_

result is: 

    {'criterion': 'gini',
     'max_depth': 4,
     'max_features': None,
     'min_samples_leaf': 5,
     'min_samples_split': 2,
     'random_state': 0,
     'splitter': 'best'}

and acording to code, training and test accuracy sopposed to be like that when tuned with those:

print(tune_model.cv_results_['mean_train_score'][tune_model.best_index_], tune_model.cv_results_['mean_test_score'][tune_model.best_index_])

output of this: 0.8924916598172832 0.8767742588186237

out of curiousity, I wanted to make my own DecisionTreeClassifier() with parameters I got from GridSearchCV,

dtree = tree.DecisionTreeClassifier(criterion = 'gini',max_depth = 4,max_features= None, min_samples_leaf= 5, min_samples_split= 2,random_state = 0,  splitter ='best')

results = model_selection.cross_validate(dtree, data1[data1_x_bin],  data1[Target],return_train_score = True, cv  = cv_split)

Same hyperparameters, same cross validation dataframes, different results. Why?

print(results['train_score'].mean(), results['test_score'].mean())

0.8387640449438202 0.8227611940298509

that one was tune_model results:

0.8924916598172832 0.8767742588186237

difference is not even small. Both results should be same if u ask me,

I don'T understand what is different? what is different so results are different?

I tried cross validating with k-fold instead of shufflesplit,

in both scenarios I tried with different random_state values, tried also random_state = None,

still different results.

can someone explain the difference please?

edit: btw, I also wanted to check test sample results:

dtree.fit(data1[data1_x_bin],data1[Target])
dtree.score(test1_x_bin,test1_y), tune_model.score(test1_x_bin,test1_y)

output: (0.8295964125560538, 0.9033059266872216)

same models(decisiontreeclassifier), same hyper-parameters, very different results

( obviously they are not same models but I can't see how and why )

Update

By default cross_validate uses the estimators score method as default to evaluate its performance (you can change that by specifiying the scoring kw argument of cross validate). The score method of the DecisionTreeClassifier class uses accuracy as its score metric. Within the GridSearchCV roc_auc is used as the score metric. Using the same score metric in both cases results in identical scores. E.g. if the score metric of cross_validate ist changed to roc_aucthe score difference you observed between models vanishes.

results = model_selection.cross_validate(dtree, data1[data1_x_bin],  data1[Target], scoring = 'roc_auc' ... )

Regarding score metrics:

The choice of the score metric determines how the performance of a model is evaluated.

Imagine a model should predict whether a traffic light is green (traffic light is green -> 1, traffic light is not green -> 0). This model can make two types of mistakes. Either it says the traffic light is green although it is not green (false positive) or it says the traffic light is not green although it is green (false negative). In this case, a false negative would be ugly, but bearable in its consequences (somebody has to wait longer at the traffic light than necessary). False positives, on the other hand, would be catastrophic (someone passes the traffic light red because it has been classified as green). In order to evaluate the model's performance, a score metric would be chosen which weighs false positives higher (i.e. classifies them as "worse" errors) than false negatives. Accuracy would be an unsuitable metric here, because false negatives and false positives would lower the score to the same extent. More suitable as a score metric would be, for example, precision. This metric weighs false positives with 1 and false negatives with 0 (the number of false negatives has no influence on the precision of a model). For a good overview what false negatives, false positives, precision, recall, accuracy etc. are see here. The beta parameter of the F score (another score metric) can be used to set how false positives should be weighted compared to false negatives (for a more detailed explanation, see here). More information about the roc_auc score can be found here (it is calculated from different statistics of the confusion matrix).

In summary, this means that the same model can perform very well in relation to one score metric, while it performs poorly in relation to another. In the case you described, the decision tree optimized by GridSearchCV and the tree you instantiated afterwards are identical models. Both yield identical accuracys or identical roc_auc scores. Which score metric you use to compare the performance of different models on your data set depends on which criteria you consider to be particularly important for model performance. If the only criterium is how many instances have been classified correctly, accuracy is a probably a good choice.

Old Idea (see comments):

You specified a random state for dtree (dtree = tree.DecisionTreeClassifier(random_state = 0 ...) , but none for the decision tree used in the GridSearchCV. Use the same random state there and let me know if that solved the problem.

tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(random_state=0), ...)

Recommended topics

Hot tags