How do people use n-grams for sentiment analysis, considering that as n increases, the memory requirement also increases rapidly?
Asked Answered
A

2

5

I am trying to do Sentiment Analysis on Tweets using Python.

To begin with, I've implemented an n-grams model. So, lets say our training data is

I am a good kid

He is a good kid, but he didn't get along with his sister much

Unigrams:

<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much>

Bigrams:

<(i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much)>

Trigrams:

<(i am a), (am a good), (a good kid), .........>

Final feature vector:

<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much, (i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much), (i am a), (am a good), (a good kid), .........>

When we do this for a large training data, of 8000 or so entries, the dimensionality of the feature vector becomes too HUGE, as a result of which, my computer (RAM=16GB) crashes.

So, when people mention using "n-grams" as features, in 100s of papers out there, what are they talking about? Am I doing something wrong?

Do people always do some feature selection for "n-grams"? If so, what kind of feature selection should I look into?

I am using scikit-learn to do this

Alithea answered 9/11, 2014 at 3:45 Comment(10)
How many unique tokens you have in the whole data? What is the size of your feature vector?Nival
Use intern() to make sure you are only storing one copy of each token.Clincher
Your thinking is exactly correct that for large values of n, the final feature vector will be large. However, it is possible to store such a large vector efficiently (knowing the co-occurrences of the words themselves). Separately, it rarely makes sense to use n>6 as you'll have insufficient training data (because of the long, tapering tail). When these papers talk about n-grams, they're not talking about a scalable n - they're USUALLY talking about a specific n (whose value might be revealed in the results or experiments section)Magistracy
@Magistracy , when you say "it is possible to store such a large vector efficiently (knowing the co-occurrences of the words themselves)", can you be more specific? How do I store this feature vector, more efficiently??Alithea
You'll have to look up such efficient storage. I don't know what it is, as I've never had to implement it myself. But I /was/ able to think of some of the basics myself, which means they have to be well researched/developedMagistracy
I think @inspectorG4dget's first comment should have be an answer, <gritted-teeth>not a comment</gritted-teeth>. :-)Axes
@DarrenCook: please expand that into an answer, if you wouldMagistracy
@Magistracy Done, and then added the optimization suggestions I wanted to put as a comment on your answer!Axes
Typically, people don't go beyond n = 5 or 6. 16GB of RAM should be more than enough for 8,000 documents. This is because even though the vector space is high-dimensional, each vector is typically sparse. All you need to do is use some form of sparse representation (e.g. maintain only the non-zero indices per vector). This approach definitely works because I have used datasets much larger than 8,000 with 4GB RAM with no trouble at all.Axillary
do you have source code for it?Perjured
M
5

If you store your final feature vector exactly as you wrote, I think I could come up with some improvements.

The memory issue is due to the fact that features (the texts) are repeated so many times, and so are the tokens. Consider this process:

First of all, all the distinct features are stored (and given an index).

For example,

1--feature1--(i am)

2--feature2--(am a)

...

This generates a so-called feature space.

There might be thousands of features in total, or even more. But that should be normal. Then each entry could be rewrote as a serial of numbers such as,

Entry1----- <1,1,1,0,....a_n>, where the first 1 means feature1(i am) has 1 occurrence in this entry, and a_n is the number of occurrence of feature n.

Let's assume there are many features and the entries are short, which means in each vector there are way too many zeros. We can rewrite the previous vector as following,

Entry1----{1:1,2:1,3:1}, which means the value of feature 1/2/3 of Entry1 is 1, and values of all the other features are zeros. Shorter, isn't it?

In the end each entry is represented as a short vector, and you get a big matrix for your corpus. Your corpus might look like this now:

{1:1, 2:1, 3:1}

{2:1, 29:1, 1029:1, 20345:1}

...

16G RAM is sufficient for 8000 entries. You can use much less.


And further more, if you get too many distinct tokens (which means toooo many features). When constructing the feature space, what can be done is to remove features whose frequencies are lower than a threshold, say 3 times. The size of the feature space could be deducted to half, or even smaller.

Moulin answered 10/11, 2014 at 5:44 Comment(1)
To add to that: if you computer crashes, you might have tried to convert the sparse matrix format that eastdog mentioned into a numpy array, which tries to represent all the 0 entries explicitly. That will very likely cause a crash.Teacup
A
2

As inspectorG4dget said in the comments, you rarely go to the high n-grams, e.g. n=5 or n=6, because you will not have enough training data to make it worthwhile. In other words almost all of your 6-grams will have an occurrence count of 1. Also, to quote inspectorG4dget's comment:

When these papers talk about n-grams, they're not talking about a scalable n - they're USUALLY talking about a specific n (whose value might be revealed in the results or experiments section)

So, usually memory is not the biggest concern. With really large corpus you will split them across a cluster, then combine results at the end. You might split based on how much memory each node in the cluster has, or if processing a stream then you might stop and upload results (to the central node) each time you fill memory.

There are some optimizations you can do. If the corpus is being held in memory, then each n-gram only needs to be an index to the first occurrence in the corpus; the string does not need to be repeated.

A second optimization, if you don't mind multiple passes, is to use the (n-1)-gram result to skip over sentence parts below your threshold. E.g. if you are only interested in n-grams that occur 3+ times, and if "He is a clever" only had a count of 2 at the 4-gram analysis, then when you discover the 5-gram of "He is a clever dog" you can throw it away, as you know it only occurs once or twice. This is a memory optimization at the expense of extra CPU.

Axes answered 10/11, 2014 at 8:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.