How to get all documents per topic in bertopic modeling
Asked Answered
H

2

10

I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic.

topic_model  = BERTopic(verbose=True, embedding_model=embedding_model,
                                nr_topics = 'auto',
                                n_gram_range = (3,3),
                                top_n_words = 10,
                               calculate_probabilities=True, 
                              seed_topic_list = topic_list,
                              )
topics, probs = topic_model.fit_transform(docs_test)
representative_doc = topic_model.get_representative_docs(topic#1)
representative_doc

this topic contain more then 300 documents but bertopic only shows 3 of them with .get_representative_docs

Heritable answered 27/10, 2021 at 14:52 Comment(0)
C
8

There are probably solutions that are more elegant because I am not an expert, but I can share what worked for me:

topics, probs = topic_model.fit_transform(docs_test)

returns the topics.

Therefore, you can combine this output and the documents. For example, combine them into a Pandas dataframe using:

df = pd.DataFrame({'topic': topics, 'document': docs_test})

Now, you can filter this dataframe for each topic to identify the referring documents:

topic_0 = df[df.topic == 0]
Clino answered 24/11, 2021 at 14:19 Comment(0)
B
5

There is an API from BERTopic get_document_info() which returns the dataframe for each document and associated topic for it. https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.get_document_info

The response from this API is shown below:

index Document Topic Name ...
0 doc1_text 241 kw1_kw2_ ...
1 doc2_text -1 kw1_kw2_ ...

You can use this dataframe to get all the documents associated for a particular topic using pandas groupby or however you prefer.

T = topic_model.get_document_info(docs)
docs_per_topics = T.groupby(["Topic"]).apply(lambda x: x.index).to_dict()

The code returns a dictionary shown as below:

{
    -1: Int64Index([3,10,11,12,15,16,18,19,20,22,...365000], dtype='int64',length=149232),
    0: Int64Index([907,1281,1335,1337,...308420,308560,308645],dtype='int64',length=5127),
    ...
}
Buote answered 15/2, 2023 at 18:12 Comment(1)
it was when bertopic don't have any API, earlier version of it was having this problem now it is quite easy.Heritable

© 2022 - 2024 — McMap. All rights reserved.