How can I check the strings tokenized inside TfidfVertorizer()
? If I don't pass anything in the arguments, TfidfVertorizer()
will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
I want something like this:
>>>vectorizer.get_processed_tokens()
[['this', 'is', 'first', 'document'],
['this', 'document', 'is', 'second', 'document'],
['this', 'is', 'the', 'third', 'one'],
['is', 'this', 'the', 'first', 'document']]
How can I do this?