doc2vec How to cluster DocvecsArray
Asked Answered
C

2

6

I've patched the following code from examples I've found over the web:

# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans

# random
from random import shuffle

# classifier

class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources

        flipped = {}

        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')

    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])

    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
        return self.sentences

    def sentences_perm(self):
        shuffle(self.sentences)
        return self.sentences

sources = {'test.txt' : 'DOCS'}
sentences = LabeledLineSentence(sources)

model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())

for epoch in range(10):
    model.train(sentences.sentences_perm())

print(model.docvecs)

my test.txt file contains a paragraph per line.

The code runs fine and generates DocvecsArray for each line of text

my goal is to have an output like so:

cluster 1: [DOC_5,DOC_100,...DOC_N]
cluster 2: [DOC_0,DOC_1,...DOC_N]

I have found the following Answer, but the output is:

cluster 1: [word,word...word]
cluster 2: [word,word...word]

How can I alter the code and get document clusters?

Cruces answered 8/9, 2016 at 13:4 Comment(0)
G
7

So it looks like you're almost there.

You are outputting a set of vectors. For the sklearn package, you have to put those into a numpy array - using the numpy.toarray() function would probably be best. The documentation for KMeans is really stellar and even across the whole library it's good.

A note for you is that I have had much better luck with DBSCAN than KMeans, which are both contained in the same sklearn library. DBSCAN doesn't require you to specify how many clusters you want to have on the output.

There are well-commented code examples in both links.

Goings answered 8/9, 2016 at 13:41 Comment(3)
cool, I'll have a look .. did you implemented document clustering?Cruces
No, but I've worked with document classification. In general it gets harder the smaller the text. Starting out, you might want to keep it to bigger texts.Goings
hdbscan.readthedocs.io/en/latest/… looks really interestingGoings
E
0

In my case I used:

for doc in docs:
    doc_vecs = model.infer_vector(doc.split())
# creating a matrix from list of vectors
mat = np.stack(doc_vecs)

# Clustering Kmeans
km_model = KMeans(n_clusters=5)
km_model.fit(mat)
# Get cluster assignment labels
labels = km_model.labels_

# Clustering DBScan
dbscan_model = DBSCAN()
labels = dbscan_model.fit_predict(mat)  

Where model is the pre-trained Doc2Vec model. In my case I didn't need to cluster the same documents of the training but new documents saved in the docs list

Enamel answered 5/6, 2018 at 11:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.