Grid search for hyperparameter evaluation of clustering in scikit-learn

Asked 5/1, 2016 at 11:49 Answered 18/6, 2020 at 20:20

python scikit-learn cluster-analysis scoring

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine.

My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV/RandomizedSearchCV, but I can't find a simple GridSearch/RandomizedSearch. I can write my own but the ParameterSampler and ParameterGrid objects are very useful.

My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv parameter?

def silhouette_score(estimator, X):
    clusters = estimator.fit_predict(X)
    score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')
    return score

ca = KMeans()
param_grid = {"n_clusters": range(2, 11)}

# run randomized search
search = GridSearchCV(
    ca,
    param_distributions=param_dist,
    n_iter=n_iter_search,
    scoring=silhouette_score,
    cv= # can I pass something here to only use a single fold?
    )
search.fit(distance_matrix)

Rookie answered 5/1, 2016 at 11:49 Comment(9)

You don't do cross-validation (or grid-search) in unsupervised data mining. Just compute the 10 runs of k-means, and use the best. – Minaminabe 5/1, 2016 at 12:16

Obviously you don't do cross-validation, but why not do grid search given an appropriate scoring metric such as silhouette score? – Rookie 5/1, 2016 at 12:21

Also, kmeans is just an example here. I'd like to test a number of different algorithms and their hyperparameters. – Rookie 5/1, 2016 at 12:22

You might as well optimize silhouette directly then. Don't expect the clustering result to really improve this way. In the end, you just look at which parameters agree best with Silhouette. It's just another criterion than SSE. – Minaminabe 5/1, 2016 at 13:44

What would I use to do that without using one of the BaseSearchCV subclasses? Have I missed some feature for optimising hyperparameters, or do you mean write something specific for each algorithm? – Rookie 5/1, 2016 at 13:48

I'm suggesting to directly search for the optimum silhouette solution, without using any clustering method. Naive enumeration won't work, but say evoluationary optimization or something like this may work. k-means does not optimize the silhouette, but that doesn't say there isn't an algorithm which does. – Minaminabe 5/1, 2016 at 16:25

Ah, I see. I may want to add extra things to the scoring method though (preferred size of clusters, similarity of clusters size, etc) so I'm really looking for a way of doing something a lot like grid search. Thanks for the suggestions though. – Rookie 5/1, 2016 at 17:10

Please see if this answers your question. – Alejandraalejandrina 18/12, 2018 at 6:17

Hey @JamieBull can I reach out to you ? – Lochia 4/7, 2023 at 13:19

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.

pip install clusteval

Depending on your data, the evaluation method can be chosen.

# Import library
from clusteval import clusteval

# Set parameters, as an example dbscan
ce = clusteval(method='dbscan')

# Fit to find optimal number of clusters using dbscan
results= ce.fit(X)

# Make plot of the cluster evaluation
ce.plot()

# Make scatter plot. Note that the first two coordinates are used for plotting.
ce.scatter(X)

# results is a dict with various output statistics. One of them are the labels.
cluster_labels = results['labx']

Tutorial answered 18/6, 2020 at 20:20 Comment(1)

this is very cool - any idea how to fit this into a pipeline to optimise earlier stages, such as TFIDF etc? – Gunyah 17/7, 2020 at 14:27

Ok, this might be an old question but I use this kind of code:

First, we want to generate all the possible combinations of parameters:

def make_generator(parameters):
    if not parameters:
        yield dict()
    else:
        key_to_iterate = list(parameters.keys())[0]
        next_round_parameters = {p : parameters[p]
                    for p in parameters if p != key_to_iterate}
        for val in parameters[key_to_iterate]:
            for pars in make_generator(next_round_parameters):
                temp_res = pars
                temp_res[key_to_iterate] = val
                yield temp_res

Then create a loop out of this:

# add fix parameters - here - it's just a random one
fixed_params = {"max_iter":300 } 

param_grid = {"n_clusters": range(2, 11)}

for params in make_generator(param_grid):
    params.update(fixed_params)
    ca = KMeans( **params )
    ca.fit(_data)
    labels = ca.labels_
    # Estimate your clustering labels and 
    # make decision to save or discard it!

Of course, it can be combined in a pretty function. So this solution is mostly an example.

Hope it helps someone!

Bipinnate answered 13/3, 2019 at 21:22 Comment(0)

Recently I ran into similar problem. I defined custom iterable cv_custom which defines splitting strategy and is an input for cross validation parameter cv. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ... In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])

N = len(distance_matrix)
cv_custom = [(range(0,N), range(0,N))]
scores = cross_val_score(clf, X, y, cv=cv_custom)

Fluff answered 24/1, 2017 at 21:24 Comment(0)

Recommended topics

Hot tags