@Aron's and @Roko Mijic's approaches neglect the fact that the function show_topics
returns by default the top 20 words of each topic only. If one returns all the words that compose a topic, all the approximated topic probabilities in that case will be 1 (or 0.999999). I experimented with the following code, which is an adaptation of @Roko Mijic's:
def topic_prob_extractor(gensim_hdp, t=-1, w=25, isSorted=True):
"""
Input the gensim model to get the rough topics' probabilities
"""
shown_topics = gensim_hdp.show_topics(num_topics=t, num_words=w ,formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
if (isSorted):
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights}).sort_values(by = "weight", ascending=False);
else:
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights});
A better, yet I'm not sure if 100% valid, approach is the one mentioned here. You can get the topics' true weights (alpha vector) of the HDP model as:
alpha = hdpModel.hdp_to_lda()[0];
Examining the topics' equivalent alpha values is more logical than tallying up the weights of the first 20 words of each topic to approximate its probability of usage in the data.