Result Difference in Stanford NER tagger NLTK (python) vs JAVA
Asked Answered
M

1

6

I am using both python and java to run the Stanford NER tagger but I am seeing the difference in the results.

For example, when I input the sentence "Involved in all aspects of data modeling using ERwin as the primary software for this.",

JAVA Result:

"ERwin": "PERSON"

Python Result:

In [6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split())
Out [6]:[(u'Involved', u'O'),
 (u'in', u'O'),
 (u'all', u'O'),
 (u'aspects', u'O'),
 (u'of', u'O'),
 (u'data', u'O'),
 (u'modeling', u'O'),
 (u'using', u'O'),
 (u'ERwin', u'O'),
 (u'as', u'O'),
 (u'the', u'O'),
 (u'primary', u'O'),
 (u'software', u'O'),
 (u'for', u'O'),
 (u'this.', u'O')]

Python nltk wrapper can't catch "ERwin" as PERSON.

What's interesting here is both Python and Java uses the same trained data (english.all.3class.caseless.distsim.crf.ser.gz) released in 2015-04-20.

My ultimate goal is to make python work in the same way Java does.

I'm looking at StanfordNERTagger in nltk.tag to see if there's anything I can modify. Below is the wrapper code:

class StanfordNERTagger(StanfordTagger):
"""
A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to:

- a model trained on training data
- (optionally) the path to the stanford tagger jar file. If not specified here,
  then this jar file must be specified in the CLASSPATH envinroment variable.
- (optionally) the encoding of the training data (default: UTF-8)

Example:

    >>> from nltk.tag import StanfordNERTagger
    >>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') # doctest: +SKIP
    >>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) # doctest: +SKIP
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
     ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
     ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
"""

_SEPARATOR = '/'
_JAR = 'stanford-ner.jar'
_FORMAT = 'slashTags'

def __init__(self, *args, **kwargs):
    super(StanfordNERTagger, self).__init__(*args, **kwargs)

@property
def _cmd(self):
    # Adding -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false for not using stanford Tokenizer  
    return ['edu.stanford.nlp.ie.crf.CRFClassifier',
            '-loadClassifier', self._stanford_model, '-textFile',
            self._input_file_path, '-outputFormat', self._FORMAT, '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions','\"tokenizeNLs=false\"']

def parse_output(self, text, sentences):
    if self._FORMAT == 'slashTags':
        # Joint together to a big list    
        tagged_sentences = []
        for tagged_sentence in text.strip().split("\n"):
            for tagged_word in tagged_sentence.strip().split():
                word_tags = tagged_word.strip().split(self._SEPARATOR)
                tagged_sentences.append((''.join(word_tags[:-1]), word_tags[-1]))

        # Separate it according to the input
        result = []
        start = 0 
        for sent in sentences:
            result.append(tagged_sentences[start:start + len(sent)])
            start += len(sent);
        return result 

    raise NotImplementedError

Or, if it's because of using different Classifier (In java code, it seems to use AbstractSequenceClassifier, on the other hand, python nltk wrapper uses the CRFClassifier.) is there a way that I can use AbstractSequenceClassifier in python wrapper?

Merrill answered 6/1, 2016 at 5:56 Comment(5)
Using CoreNLP is the way to go for flexible use of Stanford tools with python interface. But let me try whether I can hack our way out in this, after breakfast though ;)Duumvir
What is the Java command you've ran? Did you run it on the command line?Duumvir
did Gabor Angeli's solution actually work or not?Sponsor
@Toussaint they updated the maxAdditionalKnownLCWords to 0 but I am still get different results.Azoic
Have you succeeded to fix the issue?Westonwestover
H
5

Try setting maxAdditionalKnownLCWords to 0 in the properties file (or command line) for CoreNLP, and if possible for NLTK as well. This disables an option which allows the NER system to learn from test-time data a little bit, which could cause occasional mildly different results.

Hartung answered 6/1, 2016 at 7:10 Comment(3)
may I know how to set maxAdditionalKnownLCWords ?Partheniaparthenocarpy
@Gabor Can you explain a bit on this?Azoic
@Gabor Can you please help me to set maxAdditionalKnownLCWords ?Westonwestover

© 2022 - 2024 — McMap. All rights reserved.