I am trying to do Sentiment Analysis on Tweets using Python.
To begin with, I've implemented an n-grams model. So, lets say our training data is
I am a good kid
He is a good kid, but he didn't get along with his sister much
Unigrams:
<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much>
Bigrams:
<(i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much)>
Trigrams:
<(i am a), (am a good), (a good kid), .........>
Final feature vector:
<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much, (i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much), (i am a), (am a good), (a good kid), .........>
When we do this for a large training data, of 8000 or so entries, the dimensionality of the feature vector becomes too HUGE, as a result of which, my computer (RAM=16GB) crashes.
So, when people mention using "n-grams" as features, in 100s of papers out there, what are they talking about? Am I doing something wrong?
Do people always do some feature selection for "n-grams"? If so, what kind of feature selection should I look into?
I am using scikit-learn to do this
intern()
to make sure you are only storing one copy of each token. – Clinchern
, the final feature vector will be large. However, it is possible to store such a large vector efficiently (knowing the co-occurrences of the words themselves). Separately, it rarely makes sense to usen>6
as you'll have insufficient training data (because of the long, tapering tail). When these papers talk about n-grams, they're not talking about a scalablen
- they're USUALLY talking about a specificn
(whose value might be revealed in the results or experiments section) – Magistracy