Understanding the `ngram_range` argument in a CountVectorizer in sklearn
Asked Answered
B

1

43

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer.

Running this code:

from sklearn.feature_extraction.text import CountVectorizer
vocabulary = ['hi ', 'bye', 'run away']
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2))
print cv.vocabulary_

gives me:

{'hi ': 0, 'bye': 1, 'run away': 2}

Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams, like this:

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}

I am working with the documentation here: http://scikit-learn.org/stable/modules/feature_extraction.html

Clearly there is something terribly wrong with my understanding of how to use ngrams. Perhaps the argument is having no effect or I have some conceptual issue with what an actual bigram is! I'm stumped. If anyone has a word of advice to throw my way, I'd be grateful.

UPDATE:
I have realized the folly of my ways. I was under the impression that the ngram_range would affect the vocabulary, not the corpus.

Bullnecked answered 3/6, 2014 at 1:27 Comment(0)
L
45

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don't set it, you get:

>>> v = CountVectorizer(ngram_range=(1, 2))
>>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
{u'an': 0,
 u'an apple': 1,
 u'apple': 2,
 u'apple day': 3,
 u'away': 4,
 u'day': 5,
 u'day keeps': 6,
 u'doctor': 7,
 u'doctor away': 8,
 u'keeps': 9,
 u'keeps the': 10,
 u'the': 11,
 u'the doctor': 12}

An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

(Note that stopword filtering is applied before n-gram extraction, hence "apple day".)

Lodi answered 3/6, 2014 at 2:8 Comment(3)
So in your answer, you have fit on data, thus, you get the ngram_range you specified in the CountVectorizer. My solution then will be to manually vectorize my vocabulary to include 2-grams beforehand...unless you recommend any other methods?Bullnecked
@MattO'Brien What exactly are you trying to achieve?Lodi
My goal is to simply use a CountVectorizer to count how many times tokens appear in a corpus. I have a custom vocabulary, consisting of many different length grams (1, 2, 3, 4). I have been using unigrams but I want to explore counts of other length tokens as well.Bullnecked

© 2022 - 2024 — McMap. All rights reserved.