Does gensim.corpora.Dictionary have term frequency saved?
From gensim.corpora.Dictionary
, it's possible to get the document frequency of the words (i.e. how many document did a particular word occur in):
from nltk.corpus import brown
from gensim.corpora import Dictionary
documents = brown.sents()
brown_dict = Dictionary(documents)
# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')
[out]:
The word "these" appears in 1213 documents
And there is the filter_n_most_frequent(remove_n)
function that can remove the n-th most frequent tokens:
filter_n_most_frequent(remove_n)
Filter out the ‘remove_n’ most frequent tokens that appear in the documents.After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
Is the filter_n_most_frequent
function removing the n-th most frequent based on the document frequency or term frequency?
If it's the latter, is there some way to access the term frequency of the words in the gensim.corpora.Dictionary
object?