Probability prediction method of KNeighborsClassifier returns only 0 and 1
Asked Answered
C

1

12

Can anyone tell me what's the problem with my code? Why I can predict probability of iris dataset by using LinearRegression but, KNeighborsClassifier gives me 0 or 1 while it should give me a result like the one LinearRegression yields?

from sklearn.datasets import load_iris
from sklearn import metrics

iris = load_iris()
X = iris.data
y = iris.target

for train_index, test_index in skf:
    X_train, X_test = X_total[train_index], X_total[test_index]
    y_train, y_test = y_total[train_index], y_total[test_index]

from sklearn.linear_model import LogisticRegression
ln = LogisticRegression()
ln.fit(X_train,y_train)

ln.predict_proba(X_test)[:,1]

array([ 0.18075722, 0.08906078, 0.14693156, 0.10467766, 0.14823032, 0.70361962, 0.65733216, 0.77864636, 0.67203114, 0.68655163, 0.25219798, 0.3863194 , 0.30735105, 0.13963637, 0.28017798])

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='euclidean')
knn.fit(X_train, y_train)

knn.predict_proba(X_test)[0:10,1]

array([ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])

Cassandracassandre answered 7/5, 2016 at 13:30 Comment(1)
Regression != Classification. Not all classifiers support the concept of probability!Corundum
N
15

Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results. Currently your points are simply always having 5 neighbours of the same label (thus probability 0 or 1).

Norther answered 7/5, 2016 at 13:35 Comment(3)
But then my accuracy decreases because I'll go far from the optimal K. How come in weka, with the same K, we can get a more curvy ROC while here (scikit) the ROC is very sharp?Cassandracassandre
KNN is a heuristic and has a lot of parameters. It is very probably that your results will differ. You have too look up the default values of used metrics and algorithms. And maybe even the ROC-curve evaluation is done differently! There is also randomness involved (in KNN)!Corundum
Probabilities output would be more precise if use of the option "weighted = distances"Concealment

© 2022 - 2024 — McMap. All rights reserved.