Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
Asked Answered
B

11

234

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Business answered 3/4, 2011 at 12:39 Comment(11)
Note that k-means is designed for Euclidean distance. It may stop converging with other distances, when the mean is no longer a best estimation for the cluster "center".Corrosive
why k-means works only with Euclidean distsance?Hakon
@Anony-Mousse It is incorrect to say that k-means is only designed for Euclidean distance. It can be modified to work with any valid distance metric defined on the observation space. For example, take a look at the article on k-medoids.Cog
PAM (aka k-medoids) is a very different algorithm. It's related to k-means but much more expensive.Corrosive
@curious: the mean minimizes squared differences (= squared Euclidean distance). If you want a different distance function, you need to replace the mean with an appropriate center estimation. K-medoids is such an algorithm, but finding the medoid is much more expensive.Corrosive
Somewhat relevant here: there is currently an open pull request implementing Kernel K-Means. When it's finished you'll be able to specify your own kernel for the computation.Quarrelsome
@ely. "It is incorrect to say that k-means is only designed for Euclidean distance." No, it is not incorrect, IMHO. K-means and K-medoids may be related, but they are different algorithms with different underlying mathematical models, and hence different conditions for convergence. K-means assumes Euclidean distance. K-medoids assumes Manhattan distance. Please correct me if I am wrong.Undertint
@ChirazBenAbdelkader They are the same algorithm with specifically the same underlying model. They only differ in the specific calculation of the exemplar that's used (whether it is a group centroid or an actual group medoid). K-means refers to a family of algorithms that all use the same underlying model just with different notions of distance or different notions of exemplar.Cog
@ely. I agree with you partially. Maybe I'm splitting hairs. But it really depends on what you consider "same" model. Yes, Kmeans and Kmedoids are based on the same generic model. But they are sufficiently different and are certainly NOT interchangeable in practice.Undertint
@ChirazBenAbdelkader Many general algorithms come with specific variations. For example, the "algorithm" of SVM would be strictly different, in a pedantic sense, if you use RBF kernel vs. polynomial kernel, etc., and the two things would certainly not be easily interchangeable in practice. But it would be silly to say that SVM with RBF kernel is "completely different" than with polynomial kernel. Clearly it's the same algorithm, but a subset of the algorithm can be interchanged as a hyperparameter. It's the same with k-means algorithms.Cog
For example, consider just using different dissimilarity kernels for a given set of data points, like from scipy's pdist. Or consider minimizing the L1 norm or KL divergence if given data points that have a sparsity constraint or which are probability distributions respectively. There's no reason why you can't run the k-means algorithm on these types of data, and just use an appropriate different "distance between points" when minimizing the loss function with respect to candidate centers.Cog
J
89

Here's a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

Some notes added 26mar 2012:

1) for cosine distance, first normalize all the data vectors to |X| = 1; then

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

is fast. For bit vectors, keep the norms separately from the vectors instead of expanding out to floats (although some programs may expand for you). For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ), space O(N); but I don't know which programs do that.

2) Scikit-learn clustering gives an excellent overview of k-means, mini-batch-k-means ... with code that works on scipy.sparse matrices.

3) Always check cluster sizes after k-means. If you're expecting roughly equal-sized clusters, but they come out [44 37 9 5 5] % ... (sound of head-scratching).

Jetliner answered 5/4, 2011 at 12:5 Comment(18)
+1 First of all, thank you for sharing your implementation. I just wanted to confirm that the algorithm works great for my dataset of 900 vectors in a 700 dimensional space. I was just wondering if it is also possible to evaluate the quality of the clusters generated. Can any of the values in your code be reused to compute the cluster quality to aid in selecting the number of optimal clusters?Telethon
Legend, you're welcome. (Updated the code to print cluster 50 % / 90 % radius). "Cluster quality" is a largish topic: how many clusters do you have, do you have training samples with known clusters, e.g. from experts ? On number of clusters, see SO how-do-i-determine-k-when-using-k-means-clustering-when-using-k-means-clusteringJetliner
Thank you once again. Actually, I do not have the training samples but am trying to verify the clusters manually after classification (trying to play the role of the domain expert as well). I am performing a document-level classification after applying SVD to some original documents and reducing their dimension. The results seem good but I wasn't sure on how to validate them. For the initial stage, while exploring various cluster validity metrics, I came across Dunn's Index, Elbow method etc. wasn't really sure which one to utilize so thought I will start off with the Elbow method.Telethon
And of course, I was also looking at computing the silhouette width to determine the cluster quality but at first look, it seemed quite expensive though I am not sure if I missed something obvious.Telethon
Not sure if you faced this but I was trying to use my other example (#6646395) using this code to divide the 10 points into three clusters and it throws an error: ValueError: operands could not be broadcast together with shapes (0) (2) Just thought I will let you know about it as it has something to do with your latest addition.Telethon
What does kmeans( verbose=1 ) say ? (kmeanssample makes sense only for large N; nsample = max( 2*np.sqrt(N), 10*k ) may or may not be reasonable.)Jetliner
Oh I see... verbose = 1 operates correctly without any errors in my case.Telethon
I observed something else. When plotting a distortion curve for instance, we need to plot the distortion for all values of k if the slope of the curve is not so high to observe what k to pick. If that is the case (or perhaps the data is not so closely packed after all), we might have to set k ~ N. The nsample = max( 2*np.sqrt(N), 10*k ) line in your code will not let me pick large k's. I have temporarily worked around this by manually setting the sample size but would you have any other suggestions?Telethon
What are your N, dim and k please ? You can call kmeanssample( ... nsample = anything < N ), but it makes no sense to split N points into k ~ N clusters ??Jetliner
Yes that is correct. In this case, N=700, dim = 350. I am varying k to get a distortion or cluster quality plot. It looks like in high-dimensional data, it becomes tricky to apply clustering so I was looking to see how the distortion looks like. By the way, that k~N was meant to say k close to N. Sorry about the notation.Telethon
Thanks for the code Denis. It works fine for dense matrices, however it fails on a sparse mat (lil_matrix). I'm getting: kmeans: X (893, 25854) centres (9, 25854) delta=0.001 maxiter=100 metric=cosine Traceback (most recent call last): File "<stdin>", line 1, in <module> File "tweetsClustering.py", line 424, in computeClusters metric="cosine", p=2, verbose=1 ) File "kmeans.py", line 74, in kmeans centres[jc] = X[c].mean( axis=0 ) File "/usr/lib/python2.7/dist-packages/scipy/sparse/lil.py", line 201, in getitem i, j = index ValueError: too many values to unpackAlvey
@ScienceFriction, sorry about that, will look at it. Is Scikit-learn clustering an option for you ? (also added some notes at the end above).Jetliner
I know this is un-earthing something really old, but I just started with using kmeans and stumbled upon this. For future readers tempted to use this code : check out @Anony-Mousse comments on the question above first ! This implementation, as far as I can see, is making the wrong assumption that you can somehow still use the "mean of points in a cluster" to determine the centroid of that cluster. This makes no sense for anything else than Euclidean distance (except in very specific cases on the unit sphere, etc...). Again Anony-Mousse's comments on the question is right on the nose.Hagiology
@Nevoris, yes I agree, except for cosine distance: see here for why, also why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metricJetliner
the fact the algorithm is no more implemented in c++ or fortran does it means this version is far slower??Langan
@Seymour, slower what -- time to get running, lifetime CPU time, vs. clear and adaptable ? See e.g. python-with-numpy-scipy-vs-pure-c-for-big-data-analysis and hundreds of similar questions.Jetliner
Cool! So this is a good point for python, right? Because if I write my own kmeans in R, it is incredibly slower than the one implemented in fortran!Langan
@Seymour, no: for what problems, size, sparsity, clarity ? Expect HUGE differences more related to test cases / goals than language. See the 100 Q+As under stats.stackexchange.com/questions/tagged/k-means+rJetliner
S
66

Unfortunately no: scikit-learn current implementation of k-means only uses Euclidean distances.

It is not trivial to extend k-means to other distances and denis' answer above is not the correct way to implement k-means for other metrics.

Subrogation answered 3/4, 2011 at 17:17 Comment(3)
Why is the implementation given by Denis incorrect?Shortterm
For instance, for the Manhattan distance (Minkowski with p=1) for instance, you need a dedicated algorithm (K-Medoids en.wikipedia.org/wiki/K-medoids) which is quite different from K-Means internally.Subrogation
@Subrogation This is simply wrong. k-means for Manhattan distance are ... k-medians, i.e., centres are updated to the median (computed for every dimension separately) of a cluster.Nunatak
V
35

Just use nltk instead where you can do this, e.g.

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)
Velazquez answered 12/9, 2016 at 1:38 Comment(4)
How efficient is this implementation ? It seems to take forever to cluster as little as 5k points (in dimension 100).Turnsole
In dimension 100, clustering 1k points takes 1 second per run (repeats), 1.5k points take 2 minutes, and 2k takes... too long.Turnsole
Indeed; as per @Anony-Mousse comment below, it seems cosine distance may have convergence issues. To me, this is really a case of garbage-in-garbage-out: you could use whatever distance function you want, but if that function violates the assumptions of the algorithm, don't expect it to produce meaningful results!Undertint
FYI, nltk/3.7's KMeansClusterer does NOT work with the hamming distance because when calculating the new centroids (nltk/cluster/kmeans.py : 186 in KMeansClusterer._centroid(), it uses floating point arithmetic. It would require some modifications to adapt to using a hamming distance.Molnar
T
17

Yes you can use a difference metric function; however, by definition, the k-means clustering algorithm relies on the eucldiean distance from the mean of each cluster.

You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.

Tempera answered 26/3, 2012 at 15:52 Comment(6)
+1: Let me emphasize this taking the mean is only appropriate for certain distance functions, such as the Euclidean distance. For other distance functions, you'd need to replace the cluster-center estimation function, too!Corrosive
@Anony-Mousse. What am i supposed to change when i use the cosine distance for instance?Hakon
I don't know. I havn't seen a proof for convergence with Cosine. I believe it will converge if your data is non-negative and normalized to the unit sphere, because then it's essentially k-means in a different vector space.Corrosive
I agree with @Anony-Mousse. To me, this is just a case of garbage-in-garbage-out: you could run K-means with whatever distance function you want, but if that function violates the underlying assumptions of the algorithm, don't expect it to produce meaningful results!Undertint
@Anony-Mousse but how to implement K-means by using mahalnobis distance?Employee
@Cecila since using the arithmetic mean supposedly also minimizes Mahalanobis distances, this case is straightforward (I don't have a proof though). Although I'd clearly choose GMM in such cases rather than k-means.Corrosive
R
13

There is pyclustering which is python/C++ (so its fast!) and lets you specify a custom metric function

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

Actually, i haven't tested this code but cobbled it together from a ticket and example code.

Rupp answered 7/8, 2018 at 13:20 Comment(1)
needs Matplotlib installed which needs "Python as a framework on Mac OS X" :(Rupp
P
5

k-means of Spectral Python allows the use of L1 (Manhattan) distance.

Pachyderm answered 31/3, 2013 at 0:14 Comment(0)
G
4

Sklearn Kmeans uses the Euclidean distance. It has no metric parameter. This said, if you're clustering time series, you can use the tslearn python package, when you can specify a metric (dtw, softdtw, euclidean).

Goodlooking answered 27/5, 2019 at 10:45 Comment(0)
L
1

The Affinity propagation algorithm from the sklearn library allows you to pass the similarity matrix instead of the samples. So, you can use your metric to compute the similarity matrix (not the dissimilarity matrix) and pass it to the function by setting the "affinity" term to "precomputed".https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit In terms of the K-Mean, I think it is also possible but I have not tried it. However, as the other answers stated, finding the mean using a different metric will be the issue. Instead, you can use PAM (K-Medoids) algorthim as it calculates the change in Total Deviation (TD), thus it does not rely on the distance metric. https://python-kmedoids.readthedocs.io/en/latest/#fasterpam

Lengthen answered 14/12, 2022 at 7:8 Comment(0)
S
0

Yes, in the current stable version of sklearn (scikit-learn 1.1.3), you can easily use your own distance metric. All you have to do is create a class that inherits from sklearn.cluster.KMeans and overwrites its _transform method.

The below example is for the IOU distance from the Yolov2 paper.

import sklearn.cluster
import numpy as np

def anchor_iou(box_dims, centroid_box_dims):
    box_w, box_h = box_dims[..., 0], box_dims[..., 1]
    centroid_w, centroid_h = centroid_box_dims[..., 0], centroid_box_dims[..., 1]
    inter_w = np.minimum(box_w[..., np.newaxis], centroid_w[np.newaxis, ...])
    inter_h = np.minimum(box_h[..., np.newaxis], centroid_h[np.newaxis, ...])
    inter_area = inter_w * inter_h
    centroid_area = centroid_w * centroid_h
    box_area = box_w * box_h
    return inter_area / (
        centroid_area[np.newaxis, ...] + box_area[..., np.newaxis] - inter_area
    )

class IOUKMeans(sklearn.cluster.KMeans):
    def __init__(
        self,
        n_clusters=8,
        *,
        init="k-means++",
        n_init=10,
        max_iter=300,
        tol=1e-4,
        verbose=0,
        random_state=None,
        copy_x=True,
        algorithm="lloyd",
    ):
        super().__init__(
            n_clusters=n_clusters,
            init=init,
            n_init=n_init,
            max_iter=max_iter,
            tol=tol,
            verbose=verbose,
            random_state=random_state,
            copy_x=copy_x,
            algorithm=algorithm
        )

    def _transform(self, X):
        return anchor_iou(X, self.cluster_centers_)

rng = np.random.default_rng(12345)
num_boxes = 10
bboxes = rng.integers(low=0, high=100, size=(num_boxes, 2))

kmeans = IOUKMeans(num_clusters).fit(bboxes)

Skillern answered 28/10, 2022 at 8:38 Comment(1)
I think you meant to write "kmeans = IOUKMeans(num_boxes).fit(bboxes)"? Also, could you kindly provide the jaccard function instead of the anchor_iou, I think it will be a lot more helpful to n00bs like myself :)Katekatee
V
0

As of version scikit-learn==1.2.2, one could replace _euclidean_distances in sklearn.cluster._kmeans with the following:

import sklearn.cluster._kmeans as kmeans
from sklearn.metrics import pairwise_distances

def custom_distances(X, Y=None, Y_norm_squared=None, squared=False):
    if squared: #squared equals False during cluster center estimation
        return pairwise_distances(X,Y, metric='minkowski', p=1.5)
    else:
        return pairwise_distances(X,Y, metric='minkowski', p=1.5)
    
kmeans._euclidean_distances = custom_distances
kmeans.euclidean_distances = custom_distances # utilized by the method `KMeans._transform`

Then create base estimator as usual

km = kmeans.KMeans(init="k-means++", n_clusters=clusters, n_init=4, random_state=0)
Vraisemblance answered 3/1 at 10:58 Comment(0)
A
-2
def distance_metrics(dist_metrics):
    kmeans_instance = kmeans(trs_data, initial_centers, metric=dist_metrics)

    label = np.zeros(210, dtype=int)
    for i in range(0, len(clusters)):
        for index, j in enumerate(clusters[i]):
            label[j] = i
Antitrust answered 2/9, 2020 at 14:44 Comment(2)
please consider to add also some word commentTrehala
This doesn't even match the signature of sklearn.cluster.KMeans or sklearn.cluster.k_means?Brainwork

© 2022 - 2024 — McMap. All rights reserved.