How to recognize a named entity that is lowcase such as kobe bryant by CoreNLP?
Asked Answered
O

3

2

I got a problem that CoreNLP can only recognize named entity such as Kobe Bryant that is beginning with a uppercase char, but can't recognize kobe bryant as a person!!! So how to recognize a named entity that is beginning with a lowercase char by CoreNLP ???? Appreciate it !!!!

Olpe answered 14/7, 2017 at 7:46 Comment(0)
P
7

First off, you do have to accept that it is harder to get named entities right in lowercase or inconsistently cased English text than in formal text, where capital letters are a great clue. (This is also one reason why Chinese NER is harder than English NER.) Nevertheless, there are things that you must do to get CoreNLP working fairly well with lowercase text – the default models are trained to work well on well-edited text.

If you are working with properly edited text, you should use our default English models. If the text that you are working with is (mainly) lowercase or uppercase, then you should use one of the two solutions presented below. If it's a real mixture (like much social media text), you might use the truecaser solution below, or you might gain by using both the cased and caseless NER models (as a long list of models given to the ner.model property).

Approach 1: Caseless models. We also provide English models that ignore case information. They will work much better on all lowercase text.

Approach 2: Use the truecaser. We provide a truecase annotator, which attempts to convert text into formally edited capitalization. You can apply it first, and then use the regular annotators.

In general, it's not clear to us that one of these approaches usually or always wins. You can try both.

Important: To have available the extra components invoked below, you need to have downloaded the English models jar, and to have it available on your classpath.

Here's an example. We start with a sample text:

% cat lakers.txt
lonzo ball talked about kobe bryant after the lakers game.

With the default models, no entities are found and all their words just get a common noun tag. Sad!

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -file lakers.txt -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner
% cat lakers.txt.conll 
1   lonzo   lonzo   NN  O   _   _
2   ball    ball    NN  O   _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   kobe    kobe    NN  O   _   _
6   bryant  bryant  NN  O   _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   lakers  laker   NNS O   _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

Below, we ask to use the caseless models, and then we're doing pretty well: All the name words are now recognized as proper nouns, and the two person names are recognized. But the team name is still missed.

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner -file lakers.txt -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
% cat lakers.txt.conll 
1   lonzo   lonzo   NNP PERSON  _   _
2   ball    ball    NNP PERSON  _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   kobe    kobe    NNP PERSON  _   _
6   bryant  bryant  NNP PERSON  _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   lakers  lakers  NNPS    O   _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

Instead, you can run truecasing prior to POS tagging and NER:

% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,truecase,pos,lemma,ner -file lakers.txt -truecase.overwriteText
% cat lakers.txt.conll 
1   Lonzo   Lonzo   NNP PERSON  _   _
2   ball    ball    NN  O   _   _
3   talked  talk    VBD O   _   _
4   about   about   IN  O   _   _
5   Kobe    Kobe    NNP PERSON  _   _
6   Bryant  Bryant  NNP PERSON  _   _
7   after   after   IN  O   _   _
8   the the DT  O   _   _
9   Lakers  Lakers  NNPS    ORGANIZATION    _   _
10  game    game    NN  O   _   _
11  .   .   .   O   _   _

Now, the organization Lakers is recognized, and in general nearly all the entity words are tagged as proper nouns with the correct entity label, but it fails to get ball, which remains a common noun. Of course, this is a fairly hard word to get right in caseless text, since ball is a quite frequent common noun.

Postlude answered 15/7, 2017 at 20:38 Comment(2)
Seems like the truecaser has improved since. Rerunning this, it correctly capitalizes 'ball'.Migration
Yay, progress in NLP!Postlude
G
1

I have been working on NER problem for a while and it appears to me using trucase from Stanford NLP is a better solution. However, there will still be some issues that truecase can't annotate a sentence correctly. Besides the example above, trucase seems to be struggling deal with present tense sentences. For instance,

"brenda elsey told sally jenkins about it."

trucase can recognize Brenda Elsey and Sally Jenkins.

If it is

brenda elsey tells Sally Jenkins about it.

It can only get Brenda and Sally Jenkins.

If it is

brenda elsey burns sally jenkins for it.

Then it gets Brenda and Burns Sally Jenkins.

Gundry answered 2/11, 2018 at 22:13 Comment(0)
M
1

You may be interested in this paper (accepted to EMNLP 2019): https://arxiv.org/abs/1903.11222

In this paper, we experiment with several different ways of dealing with this exact problem (including the 2 mentioned by @christopher-manning above). TLDR, the main takeaways are:

  1. Using a truecaser on test data is a bad idea, because truecasers perform more poorly than you think.
  2. Caseless models work pretty well.
  3. But overall the best option is to augment the original training data with caseless training data (just train_data.lower()) and retrain the model.
Marela answered 16/10, 2019 at 2:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.