I got a problem that CoreNLP can only recognize named entity such as Kobe Bryant that is beginning with a uppercase char, but can't recognize kobe bryant as a person!!! So how to recognize a named entity that is beginning with a lowercase char by CoreNLP ???? Appreciate it !!!!
First off, you do have to accept that it is harder to get named entities right in lowercase or inconsistently cased English text than in formal text, where capital letters are a great clue. (This is also one reason why Chinese NER is harder than English NER.) Nevertheless, there are things that you must do to get CoreNLP working fairly well with lowercase text – the default models are trained to work well on well-edited text.
If you are working with properly edited text, you should use our default English models. If the text that you are working with is (mainly) lowercase or uppercase, then you should use one of the two solutions presented below. If it's a real mixture (like much social media text), you might use the truecaser solution below, or you might gain by using both the cased and caseless NER models (as a long list of models given to the ner.model
property).
Approach 1: Caseless models. We also provide English models that ignore case information. They will work much better on all lowercase text.
Approach 2: Use the truecaser. We provide a truecase
annotator, which attempts to convert text into formally edited capitalization. You can apply it first, and then use the regular annotators.
In general, it's not clear to us that one of these approaches usually or always wins. You can try both.
Important: To have available the extra components invoked below, you need to have downloaded the English models jar, and to have it available on your classpath.
Here's an example. We start with a sample text:
% cat lakers.txt
lonzo ball talked about kobe bryant after the lakers game.
With the default models, no entities are found and all their words just get a common noun tag. Sad!
% java edu.stanford.nlp.pipeline.StanfordCoreNLP -file lakers.txt -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner
% cat lakers.txt.conll
1 lonzo lonzo NN O _ _
2 ball ball NN O _ _
3 talked talk VBD O _ _
4 about about IN O _ _
5 kobe kobe NN O _ _
6 bryant bryant NN O _ _
7 after after IN O _ _
8 the the DT O _ _
9 lakers laker NNS O _ _
10 game game NN O _ _
11 . . . O _ _
Below, we ask to use the caseless models, and then we're doing pretty well: All the name words are now recognized as proper nouns, and the two person names are recognized. But the team name is still missed.
% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,pos,lemma,ner -file lakers.txt -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
% cat lakers.txt.conll
1 lonzo lonzo NNP PERSON _ _
2 ball ball NNP PERSON _ _
3 talked talk VBD O _ _
4 about about IN O _ _
5 kobe kobe NNP PERSON _ _
6 bryant bryant NNP PERSON _ _
7 after after IN O _ _
8 the the DT O _ _
9 lakers lakers NNPS O _ _
10 game game NN O _ _
11 . . . O _ _
Instead, you can run truecasing prior to POS tagging and NER:
% java edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat conll -annotators tokenize,ssplit,truecase,pos,lemma,ner -file lakers.txt -truecase.overwriteText
% cat lakers.txt.conll
1 Lonzo Lonzo NNP PERSON _ _
2 ball ball NN O _ _
3 talked talk VBD O _ _
4 about about IN O _ _
5 Kobe Kobe NNP PERSON _ _
6 Bryant Bryant NNP PERSON _ _
7 after after IN O _ _
8 the the DT O _ _
9 Lakers Lakers NNPS ORGANIZATION _ _
10 game game NN O _ _
11 . . . O _ _
Now, the organization Lakers is recognized, and in general nearly all the entity words are tagged as proper nouns with the correct entity label, but it fails to get ball, which remains a common noun. Of course, this is a fairly hard word to get right in caseless text, since ball is a quite frequent common noun.
I have been working on NER problem for a while and it appears to me using trucase from Stanford NLP is a better solution. However, there will still be some issues that truecase can't annotate a sentence correctly. Besides the example above, trucase seems to be struggling deal with present tense sentences. For instance,
"brenda elsey told sally jenkins about it."
trucase can recognize Brenda Elsey and Sally Jenkins.
If it is
brenda elsey tells Sally Jenkins about it.
It can only get Brenda and Sally Jenkins.
If it is
brenda elsey burns sally jenkins for it.
Then it gets Brenda and Burns Sally Jenkins.
You may be interested in this paper (accepted to EMNLP 2019): https://arxiv.org/abs/1903.11222
In this paper, we experiment with several different ways of dealing with this exact problem (including the 2 mentioned by @christopher-manning above). TLDR, the main takeaways are:
- Using a truecaser on test data is a bad idea, because truecasers perform more poorly than you think.
- Caseless models work pretty well.
- But overall the best option is to augment the original training data with caseless training data (just
train_data.lower()
) and retrain the model.
© 2022 - 2024 — McMap. All rights reserved.