How to allow sklearn K Nearest Neighbors to take custom distance metric?
Asked Answered
S

1

18

I have a custom distance metric that I need to use for KNN, K Nearest Neighbors.

I tried following this, but I cannot get it to work for some reason.

I would assume that the distance metric is supposed to take two vectors/arrays of the same length, as I have written below:

import sklearn 
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def d(a,b,L):
    # Inputs: a and b are rows from a data matrix   
    return a+b+2+L

knn=NearestNeighbors(n_neighbors=1,
                 algorithm='auto',
                 metric='pyfunc',
                 func=lambda a,b: d(a,b,L)
                 )


X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)

However, when I call: knn.kneighbors(), it doesn't seem to like the custom function. Here is the bottom of the error stack:

ValueError: Unknown metric pyfunc. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski'], or 'precomputed', or a callable

However, I see the exact same in the question I cited. Any ideas on how to make this work on sklearn version 0.14? I'm not aware of any differences in the versions.

Thanks.

Selenodont answered 22/12, 2015 at 3:31 Comment(1)
also your distance function is no good, it will return a vector, wheras it needs to return a single valueMervinmerwin
R
14

The documentation is actually pretty clear on the use of the metric argument:

metric : string or callable, default ‘minkowski’

metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

Thus (as also per the error message), metric should be a callable, not a string. And it should accept two arguments (arrays), and return one. Which is your lambda function.

Thus, your code can be simplified to:

import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def d(a,b,L):
    return a+b+2+L

knn=NearestNeighbors(n_neighbors=1,
                 algorithm='auto',
                 metric=lambda a,b: d(a,b,L)
                 )
X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)
Romeliaromelle answered 22/12, 2015 at 3:50 Comment(2)
Thank you. The documentation I saw was here and here, neither of which are as detailed as what you cited. Thank you.Selenodont
I used the following code. Its giving me pickling error.Can you help me with this? My code : def dist2(a,b): return jaccard(a,b) knnobj = NearestNeighbors(n_neighbors=6, algorithm='auto',metric=lambda a,b: dist2(a,b)).fit(my_Data) PicklingError: Can't pickle <type 'function'>: attribute lookup builtin.function failedChafin

© 2022 - 2024 — McMap. All rights reserved.