Does gensim.corpora.Dictionary have term frequency saved?
Asked Answered
M

6

7

Does gensim.corpora.Dictionary have term frequency saved?

From gensim.corpora.Dictionary, it's possible to get the document frequency of the words (i.e. how many document did a particular word occur in):

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

[out]:

The word "these" appears in 1213 documents

And there is the filter_n_most_frequent(remove_n) function that can remove the n-th most frequent tokens:

filter_n_most_frequent(remove_n) Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

Is the filter_n_most_frequent function removing the n-th most frequent based on the document frequency or term frequency?

If it's the latter, is there some way to access the term frequency of the words in the gensim.corpora.Dictionary object?

Mackie answered 11/10, 2017 at 9:37 Comment(0)
B
8

No, gensim.corpora.Dictionary does not save term frequency. You can see the source code here. The class only stores the following member variables:

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix

This means everything in the class defines frequency as document frequency, never term frequency, as the latter is never stored globally. This applies to filter_n_most_frequent(remove_n) as well as every other method.

Bowel answered 17/10, 2017 at 5:51 Comment(0)
B
4

I had the same simple question. It appears that the frequency of the word is hidden and not accessible in the object. Not sure why it makes testing and validation a pain. What I did was export the dictionary as text..

dictionary.save_as_text('c:\\research\\gensimDictionary.txt')

In that text file they have three columns.. For example here are the words "summit" "summon" and "sumo"

Key Word Frequency

10 summit 1227

3658 summon 118

8477 sumo 40

I found a solution the .cfs are the word frequencies.. see https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary

print(str(dictionary[10]), str(dictionary.cfs[10])) 

summit 1227

simple

Bolton answered 2/2, 2020 at 15:15 Comment(0)
B
4

gensim.corpora.Dictionary now has term frequency stored in its cfs attribute. You can see the documentation here.

cfs
Collection frequencies: token_id -> how many instances of this token are contained in the documents.
Type: dict of (int, int)

Boylan answered 27/4, 2021 at 16:16 Comment(0)
S
2

Could you do something like this?

dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab = list(dictionary.values()) #list of terms in the dictionary
vocab_tf = [dict(i) for i in corpus]
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies
Suppress answered 28/12, 2017 at 17:1 Comment(0)
D
0

Dictionary does not have it, but corpus does.

# Term frequency
# load dictionary
dictionary = corpora.Dictionary.load('YourDict.dict')
# load corpus
corpus = corpora.MmCorpus('YourCorpus.mm')
CorpusTermFrequency = array([[(dictionary[id], freq) for id, freq in cp] for cp in corpus])
Duodiode answered 23/5, 2018 at 13:46 Comment(0)
C
0

one efficient way to calculate term-frequency from bow representation rather than creating dense vectors.

corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab_tf={}
for i in corpus:
    for item,count in dict(i).items():
        if item in vocab_tf:
            vocab_tf[item]+=count
        else:
            vocab_tf[item] = count
Cockney answered 28/8, 2018 at 11:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.