Remove single occurrences of words in vocabulary TF-IDF
Asked Answered
P

2

7

I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.

tfidf = TfidfVectorizer()  
tfs = tfidf.fit_transform(df['original_post'].values.astype('U')) 

My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.

Any tips or links to further implementation?

Publicity answered 22/8, 2017 at 5:32 Comment(0)
E
19

you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer:

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)

you can also remove common words:

# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)

you can also remove stopwords like this:

tfidf = TfidfVectorizer(stop_words='english')
Everard answered 22/8, 2017 at 5:44 Comment(0)
B
3

ShmulikA's answer will most likely work well but will remove words based on document frequency. Thus, if the specific word occurs 200 times in only 1 document, it will be removed. TF-IDF vectorizer does not provide exactly what you want. You would have to:

  1. Fit the vectorizer to your corpus. Extract the complete vocabulary from the vectorizer
  2. Take the words as keys in a new dictionary.
  3. count every word occurrence:

for every document in corpus: for word in document: vocabulary[word] += 1

Now, find out if there are values = 1, drop these entries from the dictionary. Put the keys into a list and pass the list as parameter to the TF-IDF vectorizer.
It will need a lot of looping, maybe just use min_df, which works well in practice.

Baptistry answered 30/8, 2019 at 20:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.