How to perform clustering on Word2Vec

I have a semi-structured dataset, each row pertains to a single user:

id, skills
0,"java, python, sql"
1,"java, python, spark, html"
2, "business management, communication"

Why semi-structured is because the followings skills can only be selected from a list of 580 unique values.

My goal is to cluster users, or find similar users based on similar skillsets. I have tried using a Word2Vec model, which gives me very good results to identify similar skillsets - For eg.

model.most_similar(["Data Science"])

gives me -

[('Data Mining', 0.9249375462532043),
 ('Data Visualization', 0.9111810922622681),
 ('Big Data', 0.8253220319747925),...

This gives me a very good model for identifying individual skills and not group of skills. how do I make use of the vector provided from the Word2Vec model to successfully cluster groups of similar users?

You need to vectorize you strings using your Word2Vec model. You can make it possible like this:

model = KeyedVectors.load("path/to/your/model") 
w2v_vectors = model.wv.vectors # here you load vectors for each word in your model
w2v_indices = {word: model.wv.vocab[word].index for word in model.wv.vocab} # here you load indices - with whom you can find an index of the particular word in your model

Then you can use is in this way:

def vectorize(line): 
    words = []
    for word in line: # line - iterable, for example list of tokens 
        try:
            w2v_idx = w2v_indices[word]
        except KeyError: # if you does not have a vector for this word in your w2v model, continue 
            continue
        words.append(w2v_vectors[w2v_idx])
        if words: 
            words = np.asarray(words)
            min_vec = words.min(axis=0)
            max_vec = words.max(axis=0)
            return np.concatenate((min_vec, max_vec))
        if not words:
            return None

Then you receive a vector, which represents your line (document, etc).

After you received all your vectors for each of the lines, you need to cluster, you can use DBSCAN from sklearn for clustering.

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(metric='cosine', eps=0.07, min_samples=3) # you can change these parameters, given just for example 
cluster_labels = dbscan.fit_predict(X) # where X - is your matrix, where each row corresponds to one document (line) from the docs, you need to cluster

Good luck!

Recommended topics

Hot tags