Finding topics of an unseen document via Gensim
Asked Answered
A

2

7

I am using Gensim to do some large-scale topic modeling. I am having difficulty understanding how to determine predicted topics for an unseen (non-indexed) document. For example: I have 25 million documents which I have converted to vectors in LSA (and LDA) space. I now want to figure out the topics of a new document, lets call it x.

According to the Gensim documentation, I can use:

topics = lsi[doc(x)]

where doc(x) is a function that converts x into a vector.

The problem is, however, that the above variable, topics, returns a vector. The vector is useful if I am comparing x to additional documents because it allows me to find the cosine similarity between them, but I am unable to actually return specific words that are associated with x itself.

Am I missing something, or does Gensim not have this capability?

Thank you,

EDIT

Larsmans has the answer.

I was able to show the topics by using:

for t in topics:
    print lsi.show_topics(t[0])
Ahola answered 13/7, 2012 at 13:22 Comment(1)
Please could you share how you are converting x to a vector? Many Thanks!Doone
N
6

The vector returned by [] on an LSI model is actually a list of (topic, weight) pairs. You can inspect a topic by means of the method LsiModel.show_topic

Noto answered 13/7, 2012 at 15:36 Comment(1)
Ah! That was my problem, I was operating under the assumption that lsi[doc] was a vector. I had seen the show_topics method but didn't think it applied. Thank you for your help.Ahola
K
1

I was able to show the topics by using:

for t in topics: print lsi.show_topics(t[0])

Just wanted to point out a tiny, but important, bug in your solution code: you need to use show_topic() function rather than the show_topic**s**() function.

P.S. I know this should be posted as a comment rather than an answer, but my current reputation score does not allow comments just yet!

Kurth answered 17/5, 2014 at 16:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.