Why are all labels_ are -1? Generated by DBSCAN in Python
Asked Answered
M

2

7

![enter image description here][1]

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.001, min_samples=10) 
clustering = dbscan.fit(X)

Example vectors:

array([[ 0.05811029, -1.089355  , -1.9143777 , ...,  1.235167  ,
    -0.6473859 ,  1.5684978 ],
   [-0.7117326 , -0.31876346, -0.45949244, ...,  0.17786546,
     1.9377285 ,  2.190525  ],
   [ 1.1685177 , -0.18201494,  0.19475089, ...,  0.7026453 ,
     0.3937522 , -0.78675956],
   ...,
   [ 1.4172379 ,  0.01070347, -1.3984257 , ..., -0.70529956,
     0.19471683, -0.6201791 ],
   [ 0.6171041 , -0.8058429 ,  0.44837445, ...,  1.216958  ,
    -0.10003573, -0.19012968],
   [ 0.6433722 ,  1.1571665 , -1.2123466 , ...,  0.592805  ,
     0.23889546,  1.6207514 ]], dtype=float32)

X is model.wv.vectors, generated from model = word2vec.Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

Results are as follows:

array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])

Moist answered 16/1, 2020 at 18:21 Comment(14)
It's going to be hard for people to answer your question if they cannot replicate the code. Can you take the code from the images and format them here. Also, if you have some sample data you could provide that would help us in problem solving with you.Retaliate
@lwileczek I just don't know how to write the code here.Moist
Can you show the code that actually outputs the array of -1 values? Also, per the DBSCAN docs, it's designed to return -1 for 'noisy' sample that aren't in any 'high-density' cluster. It's possible that your word-vectors are so evenly distributed there are no 'high-density' clusters. (From what data are you training the word-vectors, & how large is the set of word-vectors? Have you verified the word-vectors appear sensible/useful by other checks?)Malone
You might need to tune the DBSCAN parameters for your data. And, it might make sense to operate on the unit-length-normed word-vectors, instead of the raw magnitude vectors. (Execute model.wv.init_sims(), then use model.wv.vectors_norm instead of model.wv.vectors.) Finally, min_count=1 usually results in worse word-vectors than a higher min_count value that discards words with so few usage examples. Rare words can't get strong vectors, & keeping them in training also interferes with improvement of other more-frequent words' vectors.Malone
@Malone show the array with {clustering.labels_}. And I will try your suggestion later~ thanks.Moist
@Malone sorry, I still don't know how to code in the comment box....Moist
The indentation you've used in the topmost 3 lines of code you've shown is one perfectly fine way of formatting code excerpt. There's a lot more info on ways to present your typed, or copied-and-pasted, text of code or output at: stackoverflow.com/editing-helpMalone
@Malone I tried your way with {model.wv.vectors_norm} and {model.wv.vectors}. I cannot set min_count higher, since in my dataset, there are DishNames that only show once.Moist
@Malone and with more than 30k words, and the result is bad too, only -1.Moist
@Malone stil bad even I set {min_count=5}....almost cry....Moist
Words that only have 1 example in your training data are unlikely to get good word-vectors. Their final positions will be some mix of their random starting position, & the influence of the possibly-arbitrarily idiosyncratic single usage example – offset by the influence of all the other more-frequent words on the neural-network's weights. So any patterns of their neighborhoods for clustering may be weak – they are nearly 'noise', so it wouldn't be surprising if they contribute to leaving DBSCAN with lots of 'noisy' results.Malone
30k total words would be a tiny, tiny dataset for Word2Vec purposes. Is that the size of the corpus, or the number of unique words? With a small corpus, or small number of unique words, but still multiple varied examples of each word, you might be able to get useful Word2Vec results with smaller size dimensions & more epochs training-passes, but it's not certain. Have you been able to check the vectors for usefulness separate from the clustering, by spot-reviewing if vectors' most_similar() neighbors make sense according to your domain understanding?Malone
Your best chance of getting some contrastingly-meaningful vectors could be to do all of: (1) higher min_count (while observing to see exactly how far this further shrinks the effective corpus); (2) more epochs; (3) fewer size dimensions. (Possibly also: larger window or negative.) Then also, using vectors_norm (to move all vectors to points on the 'unit sphere' for more contrast given the DBSCAN euclidean-neighborhoods). Then also, tinkering with the DBSCAN parameters to make it more sensitive.Malone
But still, you might not have enough data for Word2Vec to work well, and DBSCAN clustering might not be good for even stronger Word2Vec vectors, unless you have some external reason to believe these are the right algorithms for your data/problem-domain. Why do you want to create a fixed number of clusters from these word-vectors?Malone
Z
6

Based on the docs:

labels_array, shape = [n_samples]

Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

The answer to this you can find here: What are noisy samples in Scikit's DBSCAN clustering algorithm?

Shortword: These are not exactly part of a cluster. They are simply points that do not belong to any clusters and can be "ignored" to some extent. It seems that you have really different data, which does not have central clustering classes.

What you can try?

DBSCAN(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)

You can play with the parameters or change the clustering algorithm? Did you try kmeans?

Zoraidazorana answered 17/1, 2020 at 8:37 Comment(1)
I tried yours and it's better. But not good enough, results are -1 and 0. I had tried with Kmeans, and it worked well. I'm so curious about why such difference exists.Moist
K
1

Your eps value is 0.001; try increasing that so that you get clusters forming (or else every point will be considered an outlier / labelled -1 because it's not in a cluster)

Kinard answered 19/8, 2020 at 8:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.