DBSCAN error with cosine metric in python

Asked 23/9, 2015 at 17:10 Answered 1/11, 2016 at 21:24

Solved scikit-learn cluster-analysis data-mining cosine-similarity dbscan

I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is

db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X)

where X is a csr_matrix. The error is the following:

Metric 'cosine' not valid for algorithm 'auto',

though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1 metric.

The matrix X is large, so I can't use a precomputed matrix of pairwise distances.

I use python 2.7.6 and scikit-learn 0.16.1. My dataset doesn't have a full row of zeros, so cosine metric is well-defined.

Albertson answered 23/9, 2015 at 17:10 Comment(2)

This is arguably a bug in sklearn, frankly. Cosine similarity isn't a metric. It doesn't obey the triangle inequality, which is why it won't work with a KDTree and you have no choice but to brute force it. All of which raises the question of why when you set algorithm to 'auto,' it attempts to use a method it should know it can't use. – Preform 17/11, 2015 at 5:33

@AdamAcosta: If I understand correctly, you're arguing that the 'auto' algorithm-keyword should use 'brute' rather than try and fail using 'ball_tree' ? (I'd agree.) – Subclavius 19/1, 2017 at 23:58

The indexes in sklearn (probably - this may change with new versions) cannot accelerate cosine.

Try algorithm='brute'.

For a list of metrics that your version of sklearn can accelerate, see the supported metrics of the ball tree:

from sklearn.neighbors.ball_tree import BallTree
print(BallTree.valid_metrics)

Boozy answered 23/9, 2015 at 17:15 Comment(3)

Thanks! Now it works. Firstly, it gave me an error because I used np.float32 instead of np.double for my dataset. I suppose that DBSCAN requires such precision for the cosine metric since the latter has a small range (between 0 and 1). – Albertson 23/9, 2015 at 18:27

That should not be necessary in general, but the sklearn implementation may have such limitations. – Boozy 23/9, 2015 at 19:53

As of today (October 2019) the 'brute' algorithm does not work, but the 'generic' one does. As noted before, the .fit method needs double precision – Burgomaster 3/10, 2019 at 12:32

If you want a normalized distance like the cosine distance, you can also normalize your vectors first and then use the euclidean metric. Notice that for two normalized vectors u and v the euclidean distance is equal to sqrt(2-2*cos(u, v)) (see this discussion)

You can hence do something like:

Xnorm = np.linalg.norm(X,axis = 1)
Xnormed = np.divide(X,Xnorm.reshape(Xnorm.shape[0],1))
db = DBSCAN(eps=0.5, min_samples=2, metric='euclidean').fit(Xnormed)

The distances will lie in [0,2] so make sure you adjust your parameters accordingly.

Bobbinet answered 1/11, 2016 at 21:24 Comment(2)

Could you expand a little bit more on why the DBSCAN algorithm with euclidian-distance-on-normalised-vectors would yield the same result as with straightforwardly-cosine distance, if that is the case ? In particular, what's with the squaring/square-root, and does it matter that cosine really measures similarity and not distance (the distance is 1-cos(.;.)) – Subclavius 20/1, 2017 at 0:12

For instance, if you know that eps should be set to x with cosine distance, then it should be set to sqrt(x) when using DBSCAN with euclid. And, if such is the data, is the sklearn indexing accomplishing its fastening purpose all right ? – Subclavius 20/1, 2017 at 0:28

Recommended topics

Hot tags