As far as I know, in Bag Of Words method, features are a set of words and their frequency counts in a document. In another hand, N-grams, for example unigrams does exactly the same, but it does not take into consideration the frequency of occurance of a word.
I want to use sklearn and CountVectorizer to implement both BOW and n-gram methods.
For BOW my code looks like this:
CountVectorizer(ngram_range=(1, 1), max_features=3000)
Is is enought to set 'binary' parameter to True to perform n-gram feature selection?
CountVectorizer(ngram_range=(1, 1), max_features=3000, binary=True)
What are the advantages of n-gram over the BOW method?
(1,2)
(which includes "one"-grams and bigrams) or(2,2)
which would include only bigrams. – Chlodwig