How to train the Stanford NLP Sentiment Analysis tool

Asked 23/3, 2014 at 3:21 Answered 7/10, 2016 at 8:49

java nlp stanford-nlp sentiment-analysis

Hell everyone! I'm using the Stanford Core NLP package and my goal is to perform sentiment analysis on a live-stream of tweets.

Using the sentiment analysis tool as is returns a very poor analysis of text's 'attitude' .. many positives are labeled neutral, many negatives rated positive. I've gone ahead an acquired well over a million tweets in a text file, but I haven't a clue how to actually train the tool and create my own model.

Link to Stanford Sentiment Analysis page

"Models can be retrained using the following command using the PTB format dataset:"

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath     dev.txt -train -model model.ser.gz

Sample from dev.txt (The leading 4 represents polarity out of 5 ... 4/5 positive)

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

Sample from test.txt

(3 (3 (2 If) (3 (2 you) (3 (2 sometimes) (2 (2 like) (3 (2 to) (3 (3 (2 go) (2 (2 to) (2 (2 the) (2 movies)))) (3 (2 to) (3 (2 have) (4 fun))))))))) (2 (2 ,) (2 (2 Wasabi) (3 (3 (2 is) (2 (2 a) (2 (3 good) (2 (2 place) (2 (2 to) (2 start)))))) (2 .)))))

Sample from train.txt

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

I have two questions going forward.

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

How would I train my own model with a raw, unparsed text file full of tweets?

I'm very new to NLP so if I am missing any required information or anything at all please critique! Thank you!

Crosspiece answered 23/3, 2014 at 3:21 Comment(1)

can I see the format of train.txt? thanks – Calif 18/7, 2018 at 8:48

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

This is standard machine learning terminology. The train set is used to (surprise surprise) train a model. The development set is used to tune any parameters the model might have. What you would normally do is pick a parameter value, train a model on the training set, and then check how well the trained model does on the development set. You then pick another parameter value and repeat. This procedure helps you find reasonable parameter values for your model.

Once this is done, you proceed to test how well the model does on the test set. This is unseen- your model has never encountered any of that data before. It is important that the test set is separate from the training and development set, otherwise you are effectively evaluating a model on data it has seen before. This would be wrong as it will not give you an idea of how well the model really does.

How would I train my own model with a raw, unparsed text file full of tweets?

You can't and you shouldn't train using an unparsed set of documents. The entire point of the recursive deep model (and the reason it performs so well) is that it can learn from the sentiment annotations at every level of the parse tree. The sentence you have given above can be formatted like this:

(4 
    (4 
        (2 A) 
        (4 
            (3 (3 warm) (2 ,)) (3 funny)
        )
    ) 
    (3 
        (2 ,) 
        (3 
            (4 (4 engaging) (2 film)) (2 .)
        )
    )
)

Usually, a sentiment analyser is trained with document-level annotations. You only have one score, and this score applies to the document as a whole, ignoring the fact that the phrases in the document may express different sentiment. The Stanford team put a lot of effort into annotating every phrase in the document for sentiment. For example, the word film on its own is neutral in sentiment: (2 film). However, the phrase engaging film is very positive: (4 (4 engaging) (2 film)) (2 .)

If you have labelled tweets, you can use any other document-level sentiment classifier. The sentiment-analysis tag on stackoverflow already has some very good answers, I'm not going to repeat them here.

PS Did you label the tweets you have? All 1 million of them? If you did, I'd like to pay you a lot of money for that file :)

Fortyniner answered 25/3, 2014 at 12:54 Comment(5)

Haha. I can def use that file at the moment :) – Alenealenson 27/8, 2014 at 19:6

Is there any Java code available for creating the formatted(parsed) sentence as you have shown? For e.g. I have tweets and would like to train – Pseudohermaphrodite 7/2, 2015 at 6:56

Did you find any code to format(parse) the sentence as shown? @SameerThigale – Tumbleweed 18/9, 2017 at 18:28

train/dev/test datasets are also referred as train/test/validate – Jus 10/5, 2018 at 6:47

so were you able to train model for your tweets? I have similar goal – Calif 18/7, 2018 at 9:2

The Java code:

BuildBinarizedDataset -> [http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html

SentimentTraining -> http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/SentimentTraining.html

For those who code in C#, I converted the Java source into two code files which should make understanding this process a lot simpler.

https://arachnode.net/blogs/arachnode_net/archive/2015/09/03/buildbinarizeddataset-and-sentimenttraining-stanford-nlp.aspx

Ribbentrop answered 2/9, 2015 at 21:20 Comment(3)

This may theoretically answer the question, but it would be best to include the essential parts of the answer here for future users, and provide the link for reference. Link-dominated answers can become invalid through link rot. – Repand 2/9, 2015 at 21:33

OK, I will see if I can expand the answer in the next day. – Ribbentrop 3/9, 2015 at 1:5

@Ribbentrop That next day never came. And your links are to the javadoc explaining the class which actually does the sentiment training not to how to go about using it from the perspective of a person who does not code in java ? – Measure 1/9, 2016 at 13:53

If it helps, I got the C# code from Arachnode working very easily - a tweak or two to get the right paths for models and so on, but it then works great. What was missing was something about the right format for the input files. It's in the Javadoc, but for reference, for BuildBinarizedDataset it's something like:

2 line of text here

0 another line of text 

1 yet another line of text

etc

Building that is pretty trivial, depending on what you're starting with (a database, Excel file, whatever)

Mcclung answered 7/10, 2016 at 8:49 Comment(0)

Recommended topics

Hot tags