LDA model generates different topics everytime i train on the same corpus
Asked Answered
R

4

19

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.

Why does the same LDA parameters and corpus generate different topics everytime?

And how do i stabilize the topic generation?

I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
    # Build LDA model using the above corpus
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
    corpus_lda = lda[corpus]

    # Group topics with similar words together.
    tops = set(lda.show_topics(50))
    top_clusters = []
    for l in tops:
        top = []
        for t in l.split(" + "):
            top.append((t.split("*")[0], t.split("*")[1]))
        top_clusters.append(top)

    # Generate word only topics
    top_wordonly = []
    for i in top_clusters:
        top_wordonly.append(":".join([j[1] for j in i]))

    return lda, corpus_lda, top_clusters, top_wordonly

####################################################################### 

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
    lemma = line.split("\t")[3]
    documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
             for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
    print i
Rodl answered 25/2, 2013 at 13:8 Comment(0)
N
32

Why does the same LDA parameters and corpus generate different topics everytime?

Because LDA uses randomness in both training and inference steps.

And how do i stabilize the topic generation?

By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)

Nynorsk answered 25/2, 2013 at 14:44 Comment(8)
If the traing data is sufficient, the result should converge in limited loops. Isn't it?Beeman
May i know how do i set the numpy.random to numpy.random.seed? could you show me an example of how to call the ldamodel with numpy.random.seed?Rodl
@2er0 You don't set np.random to np.random.seed, you set the seed with np.random.seed.Nynorsk
@larsmans, so it's no np.random.seed(x) ? Could you show an example of ou how set the seed with np.random.seed?Rodl
@2er0: that's in fact it. Your edit to my answer was rejected by other users, but I reconstructed it more or less.Nynorsk
thanks for the clarification, =). BTW, the fixed random seeds actually improved my system's performance in topic inference. Because number of documents is relatively small, so training the model with 20-50 passes with 10 topics for 50-200 documents and with random.seed(10)Rodl
@2er0: sheer luck :) It happens sometimes with randomized training algorithms; trying lots of random seeds until you hit the right model (as tested on a validation set) is a pretty standard trick.Nynorsk
I am using LDA to find the topic coverages of a document. My solution has been to generate many models and run the documents through them and then take the average of the results. If you set the random seed then you will only get one version of the model and although that model will be reproducible, how do you know it's the correct one?Uticas
A
10

Set the random_state parameter in the initialization of LdaModel() method.

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=num_topics,
                                            random_state=1,
                                            passes=num_passes,
                                            alpha='auto')
Amhara answered 25/6, 2018 at 5:57 Comment(0)
C
4

I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.

Specifically, you just add the following option:

ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
Carliecarlile answered 21/1, 2017 at 3:56 Comment(1)
The question is about the randomness of the results that are generated for different runs with the same number of iterations.Amhara
C
2

This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.

Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.

Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.

You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.

Curious answered 12/3, 2020 at 20:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.