![enter image description here][1]
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.001, min_samples=10)
clustering = dbscan.fit(X)
Example vectors:
array([[ 0.05811029, -1.089355 , -1.9143777 , ..., 1.235167 ,
-0.6473859 , 1.5684978 ],
[-0.7117326 , -0.31876346, -0.45949244, ..., 0.17786546,
1.9377285 , 2.190525 ],
[ 1.1685177 , -0.18201494, 0.19475089, ..., 0.7026453 ,
0.3937522 , -0.78675956],
...,
[ 1.4172379 , 0.01070347, -1.3984257 , ..., -0.70529956,
0.19471683, -0.6201791 ],
[ 0.6171041 , -0.8058429 , 0.44837445, ..., 1.216958 ,
-0.10003573, -0.19012968],
[ 0.6433722 , 1.1571665 , -1.2123466 , ..., 0.592805 ,
0.23889546, 1.6207514 ]], dtype=float32)
X is model.wv.vectors, generated from model = word2vec.Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)
Results are as follows:
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
-1
values? Also, per theDBSCAN
docs, it's designed to return-1
for 'noisy' sample that aren't in any 'high-density' cluster. It's possible that your word-vectors are so evenly distributed there are no 'high-density' clusters. (From what data are you training the word-vectors, & how large is the set of word-vectors? Have you verified the word-vectors appear sensible/useful by other checks?) – Malonemodel.wv.init_sims()
, then usemodel.wv.vectors_norm
instead ofmodel.wv.vectors
.) Finally,min_count=1
usually results in worse word-vectors than a highermin_count
value that discards words with so few usage examples. Rare words can't get strong vectors, & keeping them in training also interferes with improvement of other more-frequent words' vectors. – MaloneDBSCAN
with lots of 'noisy' results. – MaloneWord2Vec
purposes. Is that the size of the corpus, or the number of unique words? With a small corpus, or small number of unique words, but still multiple varied examples of each word, you might be able to get usefulWord2Vec
results with smallersize
dimensions & moreepochs
training-passes, but it's not certain. Have you been able to check the vectors for usefulness separate from the clustering, by spot-reviewing if vectors'most_similar()
neighbors make sense according to your domain understanding? – Malonemin_count
(while observing to see exactly how far this further shrinks the effective corpus); (2) moreepochs
; (3) fewersize
dimensions. (Possibly also: largerwindow
ornegative
.) Then also, usingvectors_norm
(to move all vectors to points on the 'unit sphere' for more contrast given theDBSCAN
euclidean-neighborhoods). Then also, tinkering with theDBSCAN
parameters to make it more sensitive. – MaloneWord2Vec
to work well, andDBSCAN
clustering might not be good for even strongerWord2Vec
vectors, unless you have some external reason to believe these are the right algorithms for your data/problem-domain. Why do you want to create a fixed number of clusters from these word-vectors? – Malone