How to perform clustering on Word2Vec
Asked Answered
W

1

10

I have a semi-structured dataset, each row pertains to a single user:

id, skills
0,"java, python, sql"
1,"java, python, spark, html"
2, "business management, communication"

Why semi-structured is because the followings skills can only be selected from a list of 580 unique values.

My goal is to cluster users, or find similar users based on similar skillsets. I have tried using a Word2Vec model, which gives me very good results to identify similar skillsets - For eg.

model.most_similar(["Data Science"])

gives me -

[('Data Mining', 0.9249375462532043),
 ('Data Visualization', 0.9111810922622681),
 ('Big Data', 0.8253220319747925),...

This gives me a very good model for identifying individual skills and not group of skills. how do I make use of the vector provided from the Word2Vec model to successfully cluster groups of similar users?

Wendiwendie answered 28/8, 2018 at 3:7 Comment(1)
C
12

You need to vectorize you strings using your Word2Vec model. You can make it possible like this:

model = KeyedVectors.load("path/to/your/model") 
w2v_vectors = model.wv.vectors # here you load vectors for each word in your model
w2v_indices = {word: model.wv.vocab[word].index for word in model.wv.vocab} # here you load indices - with whom you can find an index of the particular word in your model 

Then you can use is in this way:

def vectorize(line): 
    words = []
    for word in line: # line - iterable, for example list of tokens 
        try:
            w2v_idx = w2v_indices[word]
        except KeyError: # if you does not have a vector for this word in your w2v model, continue 
            continue
        words.append(w2v_vectors[w2v_idx])
        if words: 
            words = np.asarray(words)
            min_vec = words.min(axis=0)
            max_vec = words.max(axis=0)
            return np.concatenate((min_vec, max_vec))
        if not words:
            return None 

Then you receive a vector, which represents your line (document, etc).

After you received all your vectors for each of the lines, you need to cluster, you can use DBSCAN from sklearn for clustering.

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(metric='cosine', eps=0.07, min_samples=3) # you can change these parameters, given just for example 
cluster_labels = dbscan.fit_predict(X) # where X - is your matrix, where each row corresponds to one document (line) from the docs, you need to cluster 

Good luck!

Cablegram answered 28/8, 2018 at 14:37 Comment(5)
I really like this approach, and it works well for clustering documents (here: user skill tags) with a limited set of vocabulary words. Just a word of caution for future readers: It didn't work at all for me with documents which are composed of whole sentences :-(Weariful
Method question: is vectorizing the strings/Word2Vec even needed? Could you just hot-code the skills and cluster based off of that?Asgard
@Cablegram What is the reason for getting the min_vec and max_vec and concatenating them?Papua
@JeffParker TFIDF would work too, but I believe vectorization lets you use similar words rather than exact words.Arbitrator
I think there is bug in vectorise(..) here, shouldn't we be checking if len(words): not if words and after we have finished iterating over all the word in line ?Tarsia

© 2022 - 2024 — McMap. All rights reserved.