Scikit: calculate precision and recall using cross_val_score function
Asked Answered
H

5

23

I'm using scikit to perform a logistic regression on spam/ham data. X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

If I want to get the accuracies for a 10 fold cross validation, I just write:

 accuracy = cross_val_score(classifier, X_train, y_train, cv=10)

I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way:

precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')

But it results in a ValueError:

ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4') 

Is it related to the data (should I binarize the labels ?) or do they change the cross_val_score function ?

Thank you in advance !

Hauser answered 8/12, 2014 at 11:34 Comment(0)
H
14

To compute the recall and precision, the data has to be indeed binarized, this way:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)

To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy:

accuracy = cross_val_score(classifier, X_train, y_train, cv=10)

It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.

Hauser answered 9/12, 2014 at 13:1 Comment(0)
C
10

I encountered the same problem here, and I solved it with

# precision, recall and F1
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
y_train = np.array([number[0] for number in lb.fit_transform(y_train)])

recall = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Recall', np.mean(recall), recall)
precision = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Precision', np.mean(precision), precision)
f1 = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('F1', np.mean(f1), f1)
Childs answered 4/5, 2015 at 8:46 Comment(0)
B
3

The syntax you showed above is correct. Looks like a problem with the data you're using. The labels don't need to be binarized, as long as they're not continuous numbers.

You can prove out the same syntax with a different dataset:

iris = sklearn.dataset.load_iris()
X_train = iris['data']
y_train = iris['target']

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

print cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
Brickwork answered 8/12, 2014 at 18:29 Comment(0)
R
1

You could use cross-validation like this to get the f1-score and recall :

print('10-fold cross validation:\n')
start_time = time()
scores = cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='f1')
recall_score=cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='recall')
print(label+" f1: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'DecisionTreeClassifier'))
print("---Classifier %s use %s seconds ---" %('DecisionTreeClassifier', (time() - start_time)))

for more scoring-parameter just see the page

Ruthenian answered 13/5, 2016 at 8:13 Comment(0)
E
1

you should specify which of the two labels is positive (it could be ham) :

from sklearn.metrics import make_scorer, precision_score

precision = make_scorer(precision_score, pos_label="ham")

accuracy = cross_val_score(classifier, X_train, y_train, cv=10, scoring = precision)
Emlin answered 5/7, 2021 at 7:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.