Bag of Words (BOW) vs N-gram (sklearn CountVectorizer) - text documents classification

About

Asked 31/7, 2018 at 20:10 Answered 19/7, 2020 at 21:32

python scikit-learn feature-extraction feature-selection n-gram

As far as I know, in Bag Of Words method, features are a set of words and their frequency counts in a document. In another hand, N-grams, for example unigrams does exactly the same, but it does not take into consideration the frequency of occurance of a word.

I want to use sklearn and CountVectorizer to implement both BOW and n-gram methods.

For BOW my code looks like this:

CountVectorizer(ngram_range=(1, 1), max_features=3000)

Is is enought to set 'binary' parameter to True to perform n-gram feature selection?

CountVectorizer(ngram_range=(1, 1), max_features=3000, binary=True)

What are the advantages of n-gram over the BOW method?

Lesko answered 31/7, 2018 at 20:10 Comment(6)

"N-grams ... does not take into consideration the frequency of occurrence of a word" No, it takes into account the frequency of occurrence of the n-grams. And no, the 'binary' parameter has nothing to do with n-grams. If you want to use n-grams, you need to provide an n-gram order where like (1,2) (which includes "one"-grams and bigrams) or (2,2) which would include only bigrams. – Chlodwig 31/7, 2018 at 20:19

Correctymy if I am wrong, but if I will do that it will count the occurance of words/phrases (unigrams or bigrams, depending on configuration). This is a BOW method. Shouldn't n-gram take care about words occurance frequency? – Lesko 31/7, 2018 at 20:24

What? Look, consider n-gram models as a type of BOW. You generally take the frequency of the ngrams. You don't have to, which is what the binary parameter is for. You can explore different approaches and their results, but typically, counts are used, not just a binary in the doc or not. – Chlodwig 31/7, 2018 at 20:25

Ahh, ok. I thought that n-gram and BOW are completely different methods... Now everything became understable. Thanks! – Lesko 31/7, 2018 at 20:27

And really, you don't generally use the raw counts, but some sort of weighting factor like tf–idf. – Chlodwig 31/7, 2018 at 20:28

I am using tf-idf as well. My task is to compare different approaches. – Lesko 31/7, 2018 at 20:30

As answered by @daniel-kurniadi you need to adapt the values of the ngram_range parameter to use the n-gram. For instance by using (1, 2), the vectorizer will take into account unigrams and bigrams.

The main advantages of ngrams over BOW i to take into account the sequence of words. For instance, in the sentences:

"I love vanilla but I hate chocolate"
"I love chocolate but I hate vanilla"

The meaning is clearly different but a basic BOW representation will be the same in both cases. With n-grams (with n>=2), it will capture the order of the terms and thus the representations will be different.

Obediah answered 19/7, 2020 at 21:32 Comment(1)

Such a clever example! Impossible to forget. – Forecast 2/10 at 20:38

If you set the ngram_range params to (m, n), then it will become an N-gram implementation.

Mailable answered 16/3, 2019 at 12:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags