Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances? - McMap

About

Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

Asked 12/4, 2014 at 15:50 Answered 12/4, 2014 at 16:15

python-2.7 scikit-learn nearest-neighbor cosine-similarity

L

1

6

I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values.

This code is supposed to find the nearest neighbors of column 21 then check the actual cosine similarity of those neighbors against column 21.

from sklearn.neighbors import NearestNeighbors
import sklearn.metrics.pairwise as smp
import numpy as np

test=np.random.randint(0,5,(50,50))
nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test)
distances, indices = nbrs.kneighbors(test)

x=21   

for idx,d in enumerate(indices[x]):

    sim2 = smp.cosine_similarity(test[:,x],test[:,d])


    print "sklearns cosine similarity would be ", sim2
    print 'sklearns reported distance is', distances[x][idx]
    print 'sklearns if that distance was cosine, the similarity would be: ' ,1- distances[x][idx]

Output looks like

sklearns cosine similarity would be  [[ 0.66190748]]
sklearns reported distance is 0.616586738214
sklearns if that distance was cosine, the similarity would be:  0.383413261786

So the output of kneighbors is neither the cosine distance or the cosine similarity. What gives?

Also, as an aside, I thought sklearn's Nearest Neighbors implementation was not an Approximate Nearest Neighbors approach, yet it doesn't seem to detect the actual best neighbors in my dataset, compared to the results I get if i iterate over the matrix and check the similarities of column 211 to all the other ones. Am I misunderstanding something basic here?

Levity answered 12/4, 2014 at 15:50 Comment(0)

L

9

Ok the problem was that NearestNeighbors's .fit() method, by default assumes the rows are samples and the columns are features. I had to tranpose the matrix before passing it to fit.

EDIT: Also, another problem is that the callable passed as metric should be a distance callable, not a similarity callable. Otherwise you'll get the K Farthest Neighbors :/

Levity answered 12/4, 2014 at 16:15 Comment(6)

2 - 2 * cosine similarity is the L2 distance of the normalized vectors – Idioplasm 12/4, 2014 at 16:26

Could you change your example to make it smaller, e.g. (20, 40) instead of (500, 500)? It took a while to run on my computer and doesn't need to be that big to prove the point. Making the shape non square can help disambiguating between samples and features axes. If, all other things equal, you write sim2 = smp.cosine_similarity(test[x, :],test[d, :]) in your loop, then all values end up coinciding. – Idioplasm 12/4, 2014 at 17:2

I changed the row/columns amounts, should run faster now – Levity 13/4, 2014 at 15:35

OK, so is there anything left to be said? You seem to have found the answer for yourself, right? Concerning the "aside" you mention, if you specify algorithm="brute" the algorithm will calculate all distances. Otherwise it may resort to smart heuristics (such as KD-trees) – Idioplasm 13/4, 2014 at 18:15

No I don't think so. I wasn't sure about the etiquette of accepting your own answer or deleting the question, it just happened that I found the problem soon after asking. – Levity 13/4, 2014 at 18:17

That's how I am computing KNN with cosine metric: NearestNeighbors(n_neighbors=5, algorithm="brute", metric="cosine"). Note algorithm="brute" parameter - that's the only way to use cosine similariry. – Jackqueline 22/1, 2015 at 16:20

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.