Stanford NLP- Sentiment analysis for Chinese language

Asked 26/10, 2014 at 9:18 Answered 27/5, 2015 at 4:36

java dataset stanford-nlp sentiment-analysis

i want to create a sentiment analysis program that takes in a dataset in Chinese and determine whether are there more of positive,negative or neutral statement. Following the example, i create a sentiment analysis for English (stanford-corenlp) which works exactly what i want but taking in Chinese.

Questions:

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
    // gender,lemma,ner,parse,pos,sentiment,sspplit, tokenize
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

     // read some text in the text variable

    String sentimentText = "Fun day, isn't it?";
    String[] ratings = {"Very Negative","Negative", "Neutral", "Positive", "Very Positive"};
    Annotation annotation = pipeline.process(sentimentText);
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
     Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
     int score = RNNCoreAnnotations.getPredictedClass(tree);
     System.out.println("sentence:'"+ sentence + "' has a score of "+ (score-2) +" rating: " + ratings[score]);
     System.out.println(tree);

Currently, i have no idea on how to change the above code to have it support Chinese Language. I downloaded the Chinese praser and segmenter and seen the demo. But after days of trying, it didn't lead to anywhere. I have also read the http://nlp.stanford.edu/software/corenlp.shtml, it is really useful for the English version. Is there any ebooks, tutorial or examples that can assist me on understanding how the Chinese sentiment analysis of Stanford NLP works ?

Thanks in advanced!

PS: I picked up java not too long ago, pardon me if there is some things that i did not say or done correctly.

What i had researched:

How to parse languages other than English with Stanford Parser？ in java, not command lines

Using stanford parser to parse Chinese

Nudi answered 26/10, 2014 at 9:18 Comment(0)

Based on my experience with German language, here is what you need to do:

Get a corpus of chinese text.
Parse each sentence.
Binarize the resulting parse trees.
For each node in the binarized parse tree, extract the phrase spanned by that node.
Annotate each phrase with a sentiment label:
- 0: very negative
- 1: slightly negative
- 2: neutral
- 3: slightly positive
- 4: very positive
Apply the labels to the parse trees using something like BuildBinarizedDataset. Note that BuildBinarizedDataset is set up for English language and will parse your sentences again. I found it more practical to apply the labels to my pre-existing parses from step 3.

For the annotation: Either do this yourself or use a crowdsourcing service like CrowdFlower. I found the 'sentiment analysis' template on CrowdFlower to be useful.

Electroencephalogram answered 29/11, 2014 at 10:9 Comment(1)

Amazon mechanical Turk can also be used for the annotation – Ppm 10/2, 2015 at 13:41

Even I'm working on the same problem and having issues. This is how much I have done:

You need to change the properties to support chinese language as follows:

props.setProperty("customAnnotatorClass.segment","edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator");


        props.setProperty("pos.model","edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger");
        props.setProperty("parse.model","edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");

        props.setProperty("segment.model","edu/stanford/nlp/models/segmenter/chinese/ctb.gz");
        props.setProperty("segment.serDictionary","edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz");
        props.setProperty("segment.sighanCorporaDict","edu/stanford/nlp/models/segmenter/chinese");
        props.setProperty("segment.sighanPostProcessing","true");

        props.setProperty("ssplit.boundaryTokenRegex","[.]|[!?]+|[。]|[！？]+");


        props.setProperty("ner.model","edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz");
        props.setProperty("ner.applyNumericClassifiers","false");
        props.setProperty("ner.useSUTime","false");

But the problem that still persists is the tokenizer being used is still defaulting to PTBTokenizer(for English).

For Spanish the corresponding properties are: props.setProperty("tokenize.language","es"); props.setProperty("sentiment.model","src/international/spanish");

        props.setProperty("pos.model","src/models/pos-tagger/spanish/spanish-distsim.tagger");


        props.setProperty("ner.model","src/models/ner/spanish.ancora.distsim.s512.crf.ser.gz");
        props.setProperty("ner.applyNumericClassifiers","false");
        props.setProperty("ner.useSUTime","false");

        props.setProperty("parse.model","src/models/lexparser/spanishPCFG.ser.gz");

This works just fine for Spanish. Notice the 'tokenize.language' property being set to 'es'. Such a property is not there for Chinese. I have tried to set it to 'ch','cn','zh','zh-cn' but nothing works. Tell me if you proceed further.

Aguedaaguero answered 27/5, 2015 at 4:36 Comment(0)

Recommended topics

Hot tags