How to train the Stanford NLP Sentiment Analysis tool
Asked Answered
C

3

15

Hell everyone! I'm using the Stanford Core NLP package and my goal is to perform sentiment analysis on a live-stream of tweets.

Using the sentiment analysis tool as is returns a very poor analysis of text's 'attitude' .. many positives are labeled neutral, many negatives rated positive. I've gone ahead an acquired well over a million tweets in a text file, but I haven't a clue how to actually train the tool and create my own model.

Link to Stanford Sentiment Analysis page

"Models can be retrained using the following command using the PTB format dataset:"

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath     dev.txt -train -model model.ser.gz

Sample from dev.txt (The leading 4 represents polarity out of 5 ... 4/5 positive)

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

Sample from test.txt

(3 (3 (2 If) (3 (2 you) (3 (2 sometimes) (2 (2 like) (3 (2 to) (3 (3 (2 go) (2 (2 to) (2 (2 the) (2 movies)))) (3 (2 to) (3 (2 have) (4 fun))))))))) (2 (2 ,) (2 (2 Wasabi) (3 (3 (2 is) (2 (2 a) (2 (3 good) (2 (2 place) (2 (2 to) (2 start)))))) (2 .)))))

Sample from train.txt

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

I have two questions going forward.

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

How would I train my own model with a raw, unparsed text file full of tweets?

I'm very new to NLP so if I am missing any required information or anything at all please critique! Thank you!

Crosspiece answered 23/3, 2014 at 3:21 Comment(1)
can I see the format of train.txt? thanksCalif
F
10

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

This is standard machine learning terminology. The train set is used to (surprise surprise) train a model. The development set is used to tune any parameters the model might have. What you would normally do is pick a parameter value, train a model on the training set, and then check how well the trained model does on the development set. You then pick another parameter value and repeat. This procedure helps you find reasonable parameter values for your model.

Once this is done, you proceed to test how well the model does on the test set. This is unseen- your model has never encountered any of that data before. It is important that the test set is separate from the training and development set, otherwise you are effectively evaluating a model on data it has seen before. This would be wrong as it will not give you an idea of how well the model really does.

How would I train my own model with a raw, unparsed text file full of tweets?

You can't and you shouldn't train using an unparsed set of documents. The entire point of the recursive deep model (and the reason it performs so well) is that it can learn from the sentiment annotations at every level of the parse tree. The sentence you have given above can be formatted like this:

(4 
    (4 
        (2 A) 
        (4 
            (3 (3 warm) (2 ,)) (3 funny)
        )
    ) 
    (3 
        (2 ,) 
        (3 
            (4 (4 engaging) (2 film)) (2 .)
        )
    )
)

Usually, a sentiment analyser is trained with document-level annotations. You only have one score, and this score applies to the document as a whole, ignoring the fact that the phrases in the document may express different sentiment. The Stanford team put a lot of effort into annotating every phrase in the document for sentiment. For example, the word film on its own is neutral in sentiment: (2 film). However, the phrase engaging film is very positive: (4 (4 engaging) (2 film)) (2 .)

If you have labelled tweets, you can use any other document-level sentiment classifier. The tag on stackoverflow already has some very good answers, I'm not going to repeat them here.

PS Did you label the tweets you have? All 1 million of them? If you did, I'd like to pay you a lot of money for that file :)

Fortyniner answered 25/3, 2014 at 12:54 Comment(5)
Haha. I can def use that file at the moment :)Alenealenson
Is there any Java code available for creating the formatted(parsed) sentence as you have shown? For e.g. I have tweets and would like to trainPseudohermaphrodite
Did you find any code to format(parse) the sentence as shown? @SameerThigaleTumbleweed
train/dev/test datasets are also referred as train/test/validateJus
so were you able to train model for your tweets? I have similar goalCalif
R
1

The Java code:

BuildBinarizedDataset -> [http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html

SentimentTraining -> http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/SentimentTraining.html

For those who code in C#, I converted the Java source into two code files which should make understanding this process a lot simpler.

https://arachnode.net/blogs/arachnode_net/archive/2015/09/03/buildbinarizeddataset-and-sentimenttraining-stanford-nlp.aspx

Ribbentrop answered 2/9, 2015 at 21:20 Comment(3)
This may theoretically answer the question, but it would be best to include the essential parts of the answer here for future users, and provide the link for reference. Link-dominated answers can become invalid through link rot.Repand
OK, I will see if I can expand the answer in the next day.Ribbentrop
@Ribbentrop That next day never came. And your links are to the javadoc explaining the class which actually does the sentiment training not to how to go about using it from the perspective of a person who does not code in java ?Measure
M
1

If it helps, I got the C# code from Arachnode working very easily - a tweak or two to get the right paths for models and so on, but it then works great. What was missing was something about the right format for the input files. It's in the Javadoc, but for reference, for BuildBinarizedDataset it's something like:

2 line of text here

0 another line of text 

1 yet another line of text

etc

Building that is pretty trivial, depending on what you're starting with (a database, Excel file, whatever)

Mcclung answered 7/10, 2016 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.