How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

Asked 2/9, 2014 at 22:27 Answered 24/6, 2023 at 7:19

I'm trying to cluster some text documents using scikit-learn. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth for MeanShift and eps for DBSCAN) best work for the kind of data I'm using (news articles).

I have some testing data which consists of pre-labeled clusters. I have been trying to use scikit-learn's GridSearchCV but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to run the evaluation on the entire dataset and compare the results to the pre-labeled data.

I have been trying to specify a scoring function which compares the estimator's labels to the true labels, but of course it doesn't work because only a sample of the data has been clustered, not all of it.

What's an appropriate approach here?

Huckster answered 2/9, 2014 at 22:27 Comment(2)

what did you end up doing in the end? – Dolomite 17/7, 2020 at 15:53

Scikit learn provide ParameterGrid from sklearn.model_selection, that should help you to loop over the grid of hyperparameters. – Szymanski 5/3, 2021 at 13:13

The following function for DBSCAN might help. I've written it to iterate over the hyperparameters eps and min_samples and included optional arguments for min and max clusters. As DBSCAN is unsupervised, I have not included an evaluation parameter.

def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
                       min_samples_space = 5, min_clust = 0, max_clust = 10):

    """
Performs a hyperparameter grid search for DBSCAN.

Parameters:
    * X_data            = data used to fit the DBSCAN instance
    * lst               = a list to store the results of the grid search
    * clst_count        = a list to store the number of non-whitespace clusters
    * eps_space         = the range values for the eps parameter
    * min_samples_space = the range values for the min_samples parameter
    * min_clust         = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
    * max_clust         = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst


Example:

# Loading Libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Loading iris dataset
iris = datasets.load_iris()
X = iris.data[:, :] 
y = iris.target

# Scaling X data
dbscan_scaler = StandardScaler()

dbscan_scaler.fit(X)

dbscan_X_scaled = dbscan_scaler.transform(X)

# Setting empty lists in global environment
dbscan_clusters = []
cluster_count   = []


# Inputting function parameters
dbscan_grid_search(X_data = dbscan_X_scaled,
                   lst = dbscan_clusters,
                   clst_count = cluster_count
                   eps_space = pd.np.arange(0.1, 5, 0.1),
                   min_samples_space = pd.np.arange(1, 50, 1),
                   min_clust = 3,
                   max_clust = 6)

"""

    # Importing counter to count the amount of data in each cluster
    from collections import Counter


    # Starting a tally of total iterations
    n_iterations = 0


    # Looping over each combination of hyperparameters
    for eps_val in eps_space:
        for samples_val in min_samples_space:

            dbscan_grid = DBSCAN(eps = eps_val,
                                 min_samples = samples_val)


            # fit_transform
            clusters = dbscan_grid.fit_predict(X = X_data)


            # Counting the amount of data in each cluster
            cluster_count = Counter(clusters)


            # Saving the number of clusters
            n_clusters = sum(abs(pd.np.unique(clusters))) - 1


            # Increasing the iteration tally with each run of the loop
            n_iterations += 1


            # Appending the lst each time n_clusters criteria is reached
            if n_clusters >= min_clust and n_clusters <= max_clust:

                dbscan_clusters.append([eps_val,
                                        samples_val,
                                        n_clusters])


                clst_count.append(cluster_count)

    # Printing grid search summary information
    print(f"""Search Complete. \nYour list is now of length {len(lst)}. """)
    print(f"""Hyperparameter combinations checked: {n_iterations}. \n""")

Oversleep answered 11/2, 2019 at 21:5 Comment(0)

Have you considered implementing the search yourself?

It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.

For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).

In other words, at which distance are two articles supposed to be clustered?

If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.

Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

Fetiparous answered 3/9, 2014 at 11:5 Comment(2)

Yes, I'm in the process of implementing it myself. I was just wondering if scikit-learn supported this out-of-the-box and I was overlooking something. My plan was to run the grid search over several different pre-labeled datasets and gain insight into the potential issue you're pointing out - thank you for pointing out the risks! – Huckster 3/9, 2014 at 12:3

sklearn.cross_validation has various iterators that yields splits of datasets (cross-validation, random splitting, etc.). Those should make this loop quite easy to write. – Tiebold 3/9, 2014 at 16:35

You may specify the cv parameter of GridSearchCV as "An iterable yielding (train, test) splits as arrays of indices" (quoted from the doc).

With DBSCAN specifically, there's one more problem -- there's no predict method. I use the solution from this answer.

Here is the example code.

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer


# The scorer function
def cmp(y_pred, y_true):
    return np.sum(y_pred == y_true)


class DBSCANWrapper(DBSCAN):
    # Won't work if `_X` is not the same X used in `self.fit`
    def predict(self, _X, _y=None):
        return self.labels_


# Let X be your data to cluster, e.g.:
X = np.random.rand(100, 10)
# Let y_true be the groundtruth clustering result, e.g.:
y_true = np.random.randint(5, size=100)
# hyper parameters to search, e.g.:
hyperparams_dict = {'eps': np.linspace(0.1, 1.0, 10)}

# Notice here, the spec of `cv`:
cv = [(np.arange(X.shape[0]), np.arange(X.shape[0]))]

search = GridSearchCV(DBSCANWrapper(), hyperparams_dict, scoring=make_scorer(cmp), cv=cv)
search.fit(X, y_true)
print(search.best_params_)

but of course it doesn't work because only a sample of the data has been clustered, not all of it.

If you want otherwise to fit on trainset and evaluate on testset different from trainset (of course this wouldn't work with DBSCAN), the above solution works as well: simply modify cv = ... line of code.

Georgeanngeorgeanna answered 24/6, 2023 at 7:19 Comment(0)

Recommended topics

Hot tags