I'm trying to cluster some text documents using scikit-learn
. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth
for MeanShift and eps
for DBSCAN) best work for the kind of data I'm using (news articles).
I have some testing data which consists of pre-labeled clusters. I have been trying to use scikit-learn
's GridSearchCV
but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to run the evaluation on the entire dataset and compare the results to the pre-labeled data.
I have been trying to specify a scoring function which compares the estimator's labels to the true labels, but of course it doesn't work because only a sample of the data has been clustered, not all of it.
What's an appropriate approach here?