I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range
argument works in a CountVectorizer.
Running this code:
from sklearn.feature_extraction.text import CountVectorizer
vocabulary = ['hi ', 'bye', 'run away']
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2))
print cv.vocabulary_
gives me:
{'hi ': 0, 'bye': 1, 'run away': 2}
Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams, like this:
{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}
I am working with the documentation here: http://scikit-learn.org/stable/modules/feature_extraction.html
Clearly there is something terribly wrong with my understanding of how to use ngrams. Perhaps the argument is having no effect or I have some conceptual issue with what an actual bigram is! I'm stumped. If anyone has a word of advice to throw my way, I'd be grateful.
UPDATE:
I have realized the folly of my ways. I was under the impression that the ngram_range
would affect the vocabulary, not the corpus.
ngram_range
you specified in the CountVectorizer. My solution then will be to manually vectorize my vocabulary to include 2-grams beforehand...unless you recommend any other methods? – Bullnecked