ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score
Asked Answered
A

4

21

I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

I am unable to understand the reason for this. Here is the code, that I am using to cluster and calculate silhouette score.

I read the csv that contains the text to be clustered and run K-Means on the n cluster values. What could be the reason I am getting this error?

  #Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score

model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)

#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []

def split_words(text):
  return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()

def preprocess_document(text):
  sp_words = split_words(text)
  return sp_words

for i, t in enumerate(overview):
  vectors.append(loaded_model.infer_vector(preprocess_document(t)))

sse = {}
silhouette = {}


for k in range(1,15):
  km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
  sse[k] = km.inertia_
  #FOLLOWING LINE CAUSES ERROR
  silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')

best_cluster_size = 1
min_error = float("inf")

for cluster_size in sse:
    if sse[cluster_size] < min_error:
        min_error = sse[cluster_size]
        best_cluster_size = cluster_size

print(sse)
print("====")
print(silhouette)
Apparitor answered 17/7, 2018 at 13:10 Comment(6)
can you add the data?Birchard
What line in your code causes the error?Lalia
@Birchard This is the link to CSV/Data from my google drive: drive.google.com/open?id=1pM0RvqyQI5IIqc_UbQL6b54p_DnnxHEDApparitor
@R.F.Nelson Sorry, I just labelled it with a comment in the question. The following line creates error : silhouette_score(vectors, km.labels_, metric='euclidean')Apparitor
can you also upload the test_filename file ?Birchard
@SuhailGupta Okay. No need for your data. I found it. see my answer and let me knowBirchard
B
39

The error is produced because you have a loop for different number of clusters n. During the first iteration, n_clusters is 1 and this leads to all(km.labels_ == 0)to be True.

In other words, you have only one cluster with label 0 (thus, np.unique(km.labels_) prints array([0], dtype=int32)).


silhouette_score requires more than 1 cluster labels. This causes the error. The error message is clear.


Example:

from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X,y)

# check how many unique labels do you have
np.unique(km.labels_)
#array([0, 1, 2], dtype=int32)

We have 3 different clusters/cluster labels.

silhouette_score(X, km.labels_, metric='euclidean')
0.38788915189699597

The function works fine.


Now, let's cause the error:

km2 = KMeans(n_clusters=1)
km2.fit(X,y)

silhouette_score(X, km2.labels_, metric='euclidean')
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
Birchard answered 17/7, 2018 at 14:58 Comment(0)
D
5

From the documentation,

Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1

So one way to solve this problem is instead of using for k in range(1,15), try to start iteration from k = 2, which is for k in range(2,15). That works for me.

Depreciable answered 27/10, 2018 at 20:38 Comment(0)
E
1

Try changing min_samples and also algorithm and metric.

for valid list of metrics and algoritms use. sklearn.neighbors.VALID_METRICS

Estrade answered 1/8, 2020 at 18:28 Comment(4)
Please consider further explaining why this could solve the problem as well as providing links to referenced external documentation.Kingfisher
Apologies. min_samples suggestion is for DBSCAN. I also got same error as above for DBSCAN but fixed that. Coming to error- for k in range(1,15) - for first iteration k=1, we have len(set(kmeans.label_) i.e. only 1 cluster. silhoute coefficient is about how close points inside a cluster are separated with respect to points from other cluster. Basic defination of silhoute coefficient requires therefore at least 2 clusters meaning you should choose cluster range between (2,15) rather than (1,15).Estrade
Please go through scikit-learn.org/stable/auto_examples/cluster/… . See the usage range_n_clusters = [2, 3, 4, 5, 6]Estrade
I believe that silhouette_score uses simple random sampling underneath, which can effectively lead to only one cluster label within the sample. Imagine sampling from two clusters of data - a huge one and a minor one.Macro
F
0

Try to increase your eps value. I was also getting the same error but when I choose the higher eps value, the error is gone.

Forth answered 13/10, 2020 at 17:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.