Sentiment Analysis on LARGE collection of online conversation text

Asked 10/3, 2013 at 19:44 Answered 11/3, 2013 at 19:4

Solved python nlp nltk text-mining sentiment-analysis

The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).

The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.

So, what I would like to know:

1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?

2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?

As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.

P.S. I have permission to analyze this data (in case it matters)

Isidraisidro answered 10/3, 2013 at 19:44 Comment(4)

Do you have any labelled data? – Kilby 10/3, 2013 at 19:48

No, that's the thing. I've been trying to do this as automated as I can... Labeling the data sounds like an extremely time consuming / mind numbing task. Is it absolutely required to gauge sentiment? If so, I would consider perhaps putting it up on Mechanical Turk or something like that... – Isidraisidro 10/3, 2013 at 20:1

All learning algorithms that I know of require you to have a training data set which you use to build a model. Then you can unleash it on unlabeled data. – Merissameristem 10/3, 2013 at 22:12

You can try semi-supervised learning, in this case you label a small subset of the data and from there it takes all of the ones it feels confidant about and trains on those as well. – Kilby 10/3, 2013 at 23:20

Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.

If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).

All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.

I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.

There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.

For determining the topic, many have used simple keywords but there are some more complex methods available.

Dispensary answered 11/3, 2013 at 19:4 Comment(4)

One question and you get best answer, about your example for frequency counting. "Not Happy". Couldn't I write an algorithm which parses each sentence independently, counts keywords, and then performs analysis by factoring in context and then going from there? For example, say the sentence included "Not Happy about my Product-Name-Here". Couldn't I write something that would notice "Happy" is being negated by the "Not", and is regarding "Product"? I know NLTK can break down sentences into verbs and nouns and such, can it not? So would it be possible to attack the problem from this angle? – Isidraisidro 20/3, 2013 at 13:17

@araibec Yes, but there are a lot of hidden complexities in trying to do that. You could use a negation word and emotion word dictionary but, if you consider the occurrence of a negation word in a sentence to mean that the emotion word is the opposite, what happens with, "I'm happy with my iPhone but my friend is not." It's telling that most current research studies choose to use methods like machine learning over keywords. Its really not much harder to set it up either. – Dispensary 20/3, 2013 at 17:48

Makes sense. Plus, machine learning could be implemented to self-optimize. Thanks for the answer! – Isidraisidro 20/3, 2013 at 20:4

Hey @jared, the link for the study by Go et al. is broken. Would you please update your answer to include that study? – Rightful 29/11, 2015 at 3:44

You could train any classifier with similar datasets and see what the results are when you apply it to your data. For example, the NLTK contains the Movie Reviews Corpus that contains 1000 positive and 1000 negative reviews. Here is an example on how to train a Naive Bayes Classifier with it. Some other review datasets like Amazon Product Review data are available here.

Another possibility is to take a list of positive and negative words like this one and count their frequencies in your dataset. If you want a complete list, use SentiWordNet.

Esquiline answered 11/3, 2013 at 13:36 Comment(0)

Recommended topics

Hot tags