scikit-learn: clustering text documents using DBSCAN
Asked Answered
J

2

22

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.

My minimal code is as follows:

docs = []
for item in [database]:
    docs.append(item)

vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)

X = X.todense() # <-- This line was needed to resolve the isse

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...

(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)

When I run this code, I get the following error when instatiating DBSCAN and calling fit():

...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range

Clicking on the line in dbscan_.py that throws the error, I noticed the following line

...
X = np.asarray(X)
n = X.shape[0]
...

When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:

...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.

Jez answered 9/8, 2014 at 9:22 Comment(0)
T
7

The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.

Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.

You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.

You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.

Mean Shift may actually need your data to be vector space of fixed dimensionality.

Tritium answered 9/8, 2014 at 10:1 Comment(5)
Quoting Homer: "Uh huh. Uh huh. Okay. Um, can you repeat the part of the stuff where you said all about the...things? Uh... the things?" :). I just started to play around, trying to follow and understand the examples. To get things working, not worrying about the results at the moment. I just can't see the difference between my setting and the examples. X.shape tells mit it's a (832, 20932) matrix which reflects my 832 documents and 20k+ different terms. But you're right, of course, I need to get a better understanding. I will have a look at ELKI. Thanks a lot!Jez
Short story: it's not a DBSCAN limitation, but it could be a scipy limitation. If np.asarray(X).shape returns a tuple, then it should not fail as above. I don't use numpy enough to be able to tell you how to properly convert a sparse matrix into a dense matrix.Tritium
I found the problem: The expected format of matrix X differs between, e.g., k-means and DBSCAN. While both expect a (n_sample, n_features) matrix, k-means expects a spare matrix, DBSCAN a dense matrix. Thus, if I add X=X.todense() before calling fit(X), it works.Jez
That is essentially what I'm trying to say. Except that technically DBSCAN does not need a dense matrix. It's the sklearn version that does, for a reason unknown to me.Tritium
Yeah, I had to get used the whole numpy matrix notions. The sklearn documentation is not intuitive without the required insights into numpy. Hence my problems. Thanks a lot for your help, I will mark your answer as correct.Jez
H
15

It looks like sparse representations for DBSCAN are supported as of Jan. 2015.

I upgraded sklearn to 0.16.1 and it worked for me on text.

Hundredfold answered 27/10, 2015 at 19:2 Comment(0)
T
7

The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.

Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.

You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.

You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.

Mean Shift may actually need your data to be vector space of fixed dimensionality.

Tritium answered 9/8, 2014 at 10:1 Comment(5)
Quoting Homer: "Uh huh. Uh huh. Okay. Um, can you repeat the part of the stuff where you said all about the...things? Uh... the things?" :). I just started to play around, trying to follow and understand the examples. To get things working, not worrying about the results at the moment. I just can't see the difference between my setting and the examples. X.shape tells mit it's a (832, 20932) matrix which reflects my 832 documents and 20k+ different terms. But you're right, of course, I need to get a better understanding. I will have a look at ELKI. Thanks a lot!Jez
Short story: it's not a DBSCAN limitation, but it could be a scipy limitation. If np.asarray(X).shape returns a tuple, then it should not fail as above. I don't use numpy enough to be able to tell you how to properly convert a sparse matrix into a dense matrix.Tritium
I found the problem: The expected format of matrix X differs between, e.g., k-means and DBSCAN. While both expect a (n_sample, n_features) matrix, k-means expects a spare matrix, DBSCAN a dense matrix. Thus, if I add X=X.todense() before calling fit(X), it works.Jez
That is essentially what I'm trying to say. Except that technically DBSCAN does not need a dense matrix. It's the sklearn version that does, for a reason unknown to me.Tritium
Yeah, I had to get used the whole numpy matrix notions. The sklearn documentation is not intuitive without the required insights into numpy. Hence my problems. Thanks a lot for your help, I will mark your answer as correct.Jez

© 2022 - 2024 — McMap. All rights reserved.