TF-IDF vectorizer to extract ngrams
Asked Answered
M

2

7

How can I use TF-IDF vectorizer from the scikit-learn library to extract unigrams and bigrams of tweets? I want to train a classifier with the output.

This is the code from scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
Meridethmeridian answered 28/10, 2020 at 8:10 Comment(0)
S
4

TfidfVectorizer has an ngram_range parameter to determin the range of n-grams you want in the final matrix as new features. In your case, you want (1,2) to go from unigrams to bigrams:

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()

pd.DataFrame(X, columns=vectorizer.get_feature_names())

        and  and this  document  document is     first  first document  \
0  0.000000  0.000000  0.314532     0.000000  0.388510        0.388510   
1  0.000000  0.000000  0.455513     0.356824  0.000000        0.000000   
2  0.357007  0.357007  0.000000     0.000000  0.000000        0.000000   
3  0.000000  0.000000  0.282940     0.000000  0.349487        0.349487   

         is    is the   is this       one  ...       the  the first  \
0  0.257151  0.314532  0.000000  0.000000  ...  0.257151   0.388510   
1  0.186206  0.227756  0.000000  0.000000  ...  0.186206   0.000000   
2  0.186301  0.227873  0.000000  0.357007  ...  0.186301   0.000000   
3  0.231322  0.000000  0.443279  0.000000  ...  0.231322   0.349487   
...
Seemly answered 28/10, 2020 at 8:18 Comment(1)
Can I change ngrams from words to characters?Meridethmeridian
G
3

According to the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

you specify n-grams when initializing TfidfVectorizer, TfidfVectorizer(ngram_range(min_n, max_n)) The lower and upper boundary of the range of n-values for different n-grams to be extracted ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

Answer would be vectorizer = TfidfVectorizer(ngram_range=(1,2))

Genseric answered 28/10, 2020 at 8:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.