How do I calculate the coherence score of an sklearn LDA model?
Asked Answered
G

2

7

Here, best_model_lda is an sklearn based LDA model and we are trying to find a coherence score for this model..

coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Coherence Score :',coherence_lda)

Output : This error pops up because i'm trying to find the coherence score of an sklearn LDA topic model, is there a way around it. Also , what metric is the sklearn LDA using to group these words together ?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics_from_model(model, topn)
   490                 matutils.argsort(topic, topn=topn, reverse=True) for topic in
--> 491                 model.get_topics()
   492             ]

AttributeError: 'LatentDirichletAllocation' object has no attribute 'get_topics'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-106-ce8558d82330> in <module>
----> 1 coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v')
     2 coherence_lda = coherence_model_lda.get_coherence()
     3 print('\n Coherence Score :',coherence_lda)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in __init__(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
   210         self._accumulator = None
   211         self._topics = None
--> 212         self.topics = topics
   213 
   214         self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in topics(self, topics)
   433                     self.model)
   434         elif self.model is not None:
--> 435             new_topics = self._get_topics()
   436             logger.debug("Setting topics to those of the model: %s", self.model)
   437         else:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics(self)
   467     def _get_topics(self):
   468         """Internal helper function to return topics from a trained topic model."""
--> 469         return self._get_topics_from_model(self.model, self.topn)
   470 
   471     @staticmethod

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics_from_model(model, topn)
   493         except AttributeError:
   494             raise ValueError(
--> 495                 "This topic model is not currently supported. Supported topic models"
   496                 " should implement the `get_topics` method.")
   497 

ValueError: This topic model is not currently supported. Supported topic models should implement the `get_topics` method.```
Gotthard answered 10/3, 2020 at 8:3 Comment(0)
P
7

You could use tmtoolkit to compute each of four coherence scores provided by gensim CoherenceModel. The authors of the documentation claim that the method tmtoolkit.topicmod.evaluate.metric_coherence_gensim "also supports models from lda and sklearn (by passing topic_word_distrib, dtm and vocab)!".

So, to get for example 'c_v' coherence metric:

# lda_model - LatentDirichletAllocation()
# vect - CountVectorizer()
# texts - the list of tokenized words
metric_coherence_gensim(measure='c_v', 
                        top_n=25, 
                        topic_word_distrib=lda_model.components_, 
                        dtm=dtm_tf, 
                        vocab=np.array([x for x in vect.vocabulary_.keys()]), 
                        texts=train['cleaned_NOUN'].values)

With regard to the second part of the question - as far as I know perplexity (often not aligned with human perception) is the native method for sklearn's LDA implementation evaluation.

Preliminary answered 2/6, 2020 at 12:17 Comment(1)
argument of type 'FakedGensimDict' is not iterable. I am getting this error. Is there a bug or am I using it wrong?Cowardly
D
3

I made the following function that takes as arguments the sklearn's LDA model and the column of the texts and returns the C_v.

from gensim.models import CoherenceModel
import gensim.corpora as corpora

def get_Cv(model, df_columnm):
  topics = model.components_

  n_top_words = 20
  texts = [[word for word in doc.split()] for doc in df_columnm]

  # create the dictionary
  dictionary = corpora.Dictionary(texts)
  # Create a gensim dictionary from the word count matrix

  # Create a gensim corpus from the word count matrix
  corpus = [dictionary.doc2bow(text) for text in texts]

  feature_names = [dictionary[i] for i in range(len(dictionary))]

  # Get the top words for each topic from the components_ attribute
  top_words = []
  for topic in topics:
      top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

  coherence_model = CoherenceModel(topics=top_words, texts=texts, dictionary=dictionary, coherence='c_v')
  coherence = coherence_model.get_coherence()
  return coherence
Discuss answered 26/1, 2023 at 16:27 Comment(1)
Thanks for this! This is preferable to the tmtoolkit implementation for me as I have a very large set of documents and while tmtoolkit allows for parallel model fitting, scikit-learn allows me to multi-process a single model and thats way better! Only question I have here, if we texts as shown, that doesn't account for the reduced terms that we may have with CountVectorizer and args like max_features; any thoughts on accounting for that?Jeremy

© 2022 - 2024 — McMap. All rights reserved.