Why DBSCAN clustering returns single cluster on Movie lens data set?
Asked Answered
R

4

6

The Scenario:

I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:

OLD FORMAT:

uid iid rat
941 1   5
941 7   4
941 15  4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4

NEW FORMAT:

uid 1               2               3               4
1   5               3               4               3
2   4               3.6185548023    3.646073985     3.9238342172
3   2.8978348799    2.6692556753    2.7693015618    2.8973463681
4   4.3320762062    4.3407749532    4.3111995162    4.3411425423
940 3.7996234581    3.4979386925    3.5707888503    2
941 5               NaN             NaN             NaN
942 4.5762594612    4.2752554573    4.2522440019    4.3761477591
943 3.8252406362    5               3.3748860659    3.8487417604

over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN. With KMeans I'm able to set and get clusters.

The Problem

The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)

Techniques Tried:

  • Tried this with IRIS data-set, where I found Species wasn't included. Clearly that is in String and besides is to be predicted, and everything just works fine with that Dataset (Snippet 1)
  • Tried with Movie Lens 100K dataset in OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it. (Snippet 2)
  • Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.

Snippet 1:

print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris

iris = load_iris()
dbscan = DBSCAN()

d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)

Snippet 1 (Output):

FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]: 
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1, -1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

Snippet 2:

import pandas as pd
from sklearn.cluster import DBSCAN

data_set = pd.DataFrame

ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
    data_set = pd.read_csv("MainMatrix_IBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
elif ch is 2:
    data_set = pd.read_csv("MainMatrix_UBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
else:
    print "Enter Proper choice!"

print "Starting with DBSCAN for Clustering on\n", data_set.info()

db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

Snippet 2 (Output):

Extended Cluster Methods for:
1. Main Matrix IBCF 
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])

As seen, it returns only 1 Cluster. I'd like to hear what am I doing wrong.

Ratafia answered 1/1, 2018 at 17:35 Comment(0)
R
9

As pointed by @faraway and @Anony-Mousse, the solution is more Mathematical on Dataset than Programming.

Could finally figure out the clusters. Here's how:

db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d

The Epsilon and Out-lier concept turned out more brightening from SO: How can I choose eps and minPts (two parameters for DBSCAN algorithm) for efficient results?.

Ratafia answered 17/1, 2018 at 7:40 Comment(7)
With min_samples=2, you are not really doing DBSCAN, but single-linkage clustering. For real DBSCAN, choose larger minimum sizes (otherwise, everything is dense).Gentes
I tried increasing, however it returns more outliers. Any solution to that? @ErichRatafia
I tried increasing, however it returns more outliers. Any solution to that? @ErichRatafia
So, is there a standard not to set min_samples to 2? Is there any equation that can retain min_samples w.r.t. Dataset?Ratafia
Well, as said before, with min_samples<=2, you are getting a single-linkage clustering, which long predates DBSCAN. If you want density based clustering, you need to use enough samples to get density. Define "retain" for the second part.Gentes
Retain meaning, determining the number based on dataset one is handling.Ratafia
"Density = points / radius" has a fairly stable meaning in many applications, if enough points are considered. This depends on how well your distance function retains its meaning.Gentes
S
5

You need to choose appropriate parameters. With a too small epsilon, everything becomes noise. sklearn shouldn't have a default value for this parameter, it needs to be chosen for each data set differently.

You also need to preprocess your data.

It's trivial to get "clusters" with kmeans that are meaningless...

Don't just call random functions. You need to understand what you are doing, or you are just wasting your time.

Seko answered 1/1, 2018 at 23:56 Comment(2)
Great Advice, but I can't really put out my actual objective and code here. Just understand that I need these 2 clustering methods to be done. If you can point out preprocessing required and parameters to use, that would be of real use to me.Ratafia
Read the DBSCAN paper. The parameters are documented there. Preprocessing is similar to what is needed to make kmeans return meaningful results if you use Euclidean distance (but in contrast to kmeans, you can use other distances that are more relevant for you mystery objective).Seko
M
1

Firstly you need to preprocess your data removing any useless attribute such as ids, and incomplete instances (in case your chosen distance measure can't handle it).

It's good to understand that these algorithms are from two different paradigms, centroid-based (KMeans) and density-based (DBSCAN & HDBSCAN*). While centroid-based algorithms usually have the number of clusters as a input parameter, density-based algorithms need the number of neighbors (minPts) and the radius of the neighborhood (eps).

Normally in the literature the number of neighbors (minPts) is set to 4 and the radius (eps) is found through analyzing different values. You may find HDBSCAN* easier to use as you only need to inform the number of neighbors (minPts).

If after trying different configurations, you still getting useless clusterings, maybe your data haven't clusters at all and the KMeans output is meaningless.

Melodist answered 2/1, 2018 at 19:34 Comment(0)
C
0

Have you tried seeing how the cluster looks in 2D space using PCA (e.g). If whole data is dense and actually forms single group probably then you might get single cluster.

Change other parameters like min_samples=5, algorithm, metric. Possible value of algorithm and metric you can check from sklearn.neighbors.VALID_METRICS.

Cantatrice answered 1/8, 2020 at 16:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.