Why does sklearn tf-idf vectorizer give the highest scores to stopwords?

Asked 2/1, 2022 at 14:57 Answered 18/1, 2022 at 14:38

python scikit-learn nltk tf-idf tfidfvectorizer

I implemented Tf-idf with sklearn for each category of the Brown corpus in nltk library. There are 15 categories and for each of them the highest score is assigned to a stopword.

The default parameter is use_idf=True, so I'm using idf. The corpus is big enough to calculate right scores. So, I don't get it - why are stopwords assigned high values?

import nltk, sklearn, numpy
import pandas as pd
from nltk.corpus import brown, stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('brown')
nltk.download('stopwords')

corpus = []
for c in brown.categories():
  doc = ' '.join(brown.words(categories=c))
  corpus.append(doc)

thisvectorizer = TfidfVectorizer()
X = thisvectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
features = thisvectorizer.get_feature_names_out()

for array in tfidf_matrix:
  tfidf_per_doc = list(zip(features, array))
  tfidf_per_doc.sort(key=lambda x: x[1], reverse=True)
  print(tfidf_per_doc[:3])

The result is:

[('the', 0.6893251240111703), ('and', 0.31175508121108203), ('he', 0.24393467757919754)]
[('the', 0.6907757197452503), ('of', 0.4103688069243256), ('and', 0.28727742797362427)]
[('the', 0.7263025975051108), ('of', 0.3656242079748301), ('to', 0.291070574384772)]
[('the', 0.6754696081456901), ('and', 0.31548027033056486), ('to', 0.2688347676067454)]
[('the', 0.6814989142114783), ('of', 0.45275950370682505), ('and', 0.2884682701141856)]
[('the', 0.695577697455948), ('of', 0.35341130124782577), ('and', 0.31967658612871513)]
[('the', 0.6319718467602307), ('and', 0.3252073024670836), ('of', 0.31905971640910474)]
[('the', 0.7201346766200954), ('of', 0.4283480504712354), ('and', 0.2462470090388333)]
[('the', 0.7145625245362096), ('of', 0.3795569321959571), ('and', 0.2911711705971684)]
[('the', 0.6452744438258314), ('to', 0.2965331457609836), ('and', 0.29378534827130653)]
[('the', 0.7507413874270662), ('of', 0.3364825248186412), ('and', 0.25753131787795447)]
[('the', 0.6883038024694869), ('of', 0.41770049303087814), ('and', 0.2675503490244296)]
[('the', 0.6952456562438267), ('of', 0.39285038765440655), ('and', 0.34045082029960866)]
[('the', 0.5816391566950566), ('and', 0.3731049841274644), ('to', 0.2960718382909285)]
[('the', 0.6514884130485116), ('of', 0.29645876610367955), ('to', 0.2766347756651356)]

Every word is a stopword. Approximately first 15 words for each category are stopwords.

If I use the parameter stop_words with nltk built-in stopwords, the values are more or less fine. But this doesn't make sense to me - Tf-idf should downgrade them by default, shouldn't it? Did I make a stupid mistake somewhere?

my_stop_words = stopwords.words('english')
thisvectorizer = TfidfVectorizer(stop_words=my_stop_words)

[('said', 0.27925480211869536), ('would', 0.18907877226786665), ('man', 0.18520023334955144)]
[('one', 0.2904582969159082), ('would', 0.1989714323107254), ('new', 0.1394799739062623)]
[('would', 0.2225121466087311), ('one', 0.21533433542780428), ('new', 0.1603044497073654)]
[('would', 0.3015860042740072), ('said', 0.20105733618267146), ('one', 0.19691182409643082)]
[('state', 0.20994145654158766), ('year', 0.16516637619246616), ('fiscal', 0.1627693480477495)]
[('one', 0.27315617167196987), ('new', 0.1339515841852929), ('time', 0.12957408143413954)]
[('said', 0.25253824925464713), ('barco', 0.2297681382507305), ('one', 0.22671047376269457)]
[('af', 0.53260466412674), ('one', 0.2029977500545255), ('may', 0.12401317094240104)]
[('one', 0.29617565661385375), ('time', 0.15556701155475144), ('would', 0.14135656338388475)]
[('said', 0.22644107030344426), ('would', 0.2097909916046616), ('one', 0.1986909391388065)]
[('said', 0.2724277852935244), ('mrs', 0.19471476451838934), ('would', 0.1650670817295739)]
[('god', 0.2540052570261857), ('one', 0.18304020379411245), ('church', 0.17784155752544287)]
[('one', 0.2402151822472666), ('mr', 0.1854602509997279), ('new', 0.16073221753309752)]
[('said', 0.32053197885047946), ('would', 0.23918851593978377), ('could', 0.18980141345828996)]
[('helva', 0.34147320176374735), ('ekstrohm', 0.27116989551827), ('would', 0.2609130084842849)]

Kantar answered 2/1, 2022 at 14:57 Comment(1)

What happens if you use my_stop_words = list(stopwords.words('english')) – Arielle 2/1, 2022 at 18:55

Stopwords are assigned a large value as there is a problem with your corpus and tfidf calculation.

The shape of the matrix X is (15, 42396) meaning that you have only 15 documents and these documents contains 42396 different words.

The mistake is that you are concatenating all words of a given category into one document instead of using all defined document in this snippet:

for c in brown.categories():
  doc = ' '.join(brown.words(categories=c))
  corpus.append(doc)

You can modify your code to:

for c in brown.categories():
    doc = [" ".join(x) for x in brown.sents(categories=c)]
    corpus.extend(doc)

which will create one entry per document. Therefore your X matrix will then have a shape of (57340, 42396).

This is really important as the stopwords will appear in most documents which will assign them a really low TFIDF value.

You can have a look at the 25 most important words with the following snippet:

import numpy as np
feature_names = thisvectorizer.get_feature_names_out()
sorted_nzs = np.argsort(X.data)[:-(25):-1]
feature_names[X.indices[sorted_nzs]]

Output:

 array(['customer', 'asked', 'properties', 'itch', 'locked', 'achieving',
        'jack', 'guess', 'criticality', 'me', 'sir', 'beckworth', 'visa',
        'will', 'casey', 'athletics', 'norms', 'yeah', 'eh', 'oh', 'af',
        'currency', 'example', 'movies'], dtype=object)

Hrutkay answered 3/1, 2022 at 7:42 Comment(3)

Thank you! Hmm, but initially with 15 documents, stopwords (like "the") definitely were in each of those 15 documents - why did they have high values then? – Kantar 3/1, 2022 at 12:20

I actually had only 15 documents in the corpus on purpose - I want to compare the most important words for each category from the Brown corpus. – Kantar 3/1, 2022 at 12:22

The shape of the matrix is (2351, 36092), but, I am still having this issue. The highest scores are assigned to the stop words. – Marshy 14/4, 2022 at 15:41

"The corpus is big enough...". Actually, in this case, it is the size of each document/text in the corpus that is big enough. The corpus's size is, however, only 15 documents (thus, N in idf would be 15). If you print brown.categories(), you'll see that the Brown corpus contains 15 categories, which are used as your documents. Having a small corpus means that some terms (such s stop words) will have the same distribution across documents in the corpus, and thus, will get penalised the same way by idf. If, for example, the word "customer" occurs just as "and" in a corpus (i.e., both appear in the same number of documents), their idf value will be the same; however, stop words (such as "and" above), due to their usually larger term frequency tf, they will be given higher tf-idf scores than words such as "customer"; which might appear in every document as well (as an example), but with lower term frequency.

However, the number of documents in a corpus is only part of the problem here, as, indeed, Tf-idf is known to downgrade such frequently occuring terms, while highlighting the ones that are frequent in a document and rare in all other ones. The second probable cause here is how sklearn's TfidfVectorizer (and hence, TfidfTransformer) computes the tf-idf scores. As per the documentation, the tf-idf formula, by default, is computed as idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1 (with also cosine normalisation), which differs from the standard formula, i.e., idf(t) = log [ n / df(t) ]. So, in a nutshell, one should use enough sample of documents when using tf-idf. Additionally, it might be worth experimenting with the standard formula for calculating tf-idf, and see how that works for them. I recently posted an extended answer explaining this on a very similar question, showing that as the size of the corpus (i.e., number of documents) increases, more stop words (or commonly occuring words in a corpus) are eliminated. Please have a look here.

Okeechobee answered 18/1, 2022 at 14:38 Comment(1)

The shape of the matrix is (2351, 36092), but, I am still having this issue. The highest scores are assigned to the stop words. – Marshy 14/4, 2022 at 15:40

Recommended topics

Hot tags