Stanford NER toolkit - lowercase entities recognition
Asked Answered
S

5

6

I am a newbie to NLP and trying to figure out how a Named Entity Recognizer annotates named entities. I am experimenting with Stanford NER toolkit. When I use the NER on standard more formal datasets where all naming conventions are followed to represent named entities such as in newswires or news blogs, the NER annotates the entities correctly. However when I run NER with informal datasets such as twitter, where named entities might not be capitalized as should have been, The NER does not annotate the entities. The classifier that I am using is a 3-CRF serialised classifer. Can anybody let me know how can I make the NER recognize lower case entities too?? Any useful suggestions on how to hack the NER and where this improvement is to be done is greatly appreciated. Thanks in advance for all your help.

Silly answered 20/11, 2010 at 23:39 Comment(1)
Are you training on tagged tweets, or are you trying to use a pre-existing model that's probably already trained on newswire text?Supertax
A
5

I'm afraid there isn't an easy way to get the trained models we distribute to ignore case information at runtime. So, yes, they'll usually only label capitalized names. It would be possible to train a caseless model, which would work reasonably (but not as well on cased text, since case is a big clue in English (but not in German, Chinese, Arabic, etc.).

Audly answered 15/12, 2010 at 0:9 Comment(2)
Revised answer: We're now distributing causeless models for several of our tools which will work much better run on uncased text. (Though not as well as running cased models on cased text, since capitalization does give useful information in English!) You can download them separately from here: nlp.stanford.edu/software/CRF-NER.shtml .Audly
And we now have a truecaser. You can now find a much more detailed answer to this question here: #45098007Audly
M
5

I know it is an old thread but hoping it will help someone. As christopher manning has replied, the way to get lowercase detected is to replace english.muc.7class.distsim.crf.ser.gz with english.muc.7class.caseless.distsim.crf.ser.gz that you can get when you unzip the core nlp caseless jar file.

For example, in my python file I have kept everything same except changing to the new file and it works perfectly (well, most of the time)

st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')
Mangosteen answered 12/12, 2014 at 14:30 Comment(0)
B
2

Along with other people's suggestions. If you're using a feature-based classifier, I would definitely add in the 100-200 most common 3-4 letter substrings in people's names or making a gazzeteer under one recognized feature. There are certain patterns that are bound to show up quite a bit in personal names that don't show up very often in other types of words, like "eli."

Brianabriand answered 7/6, 2012 at 7:42 Comment(0)
L
1

I think Twitter is going to be very difficult for this application. Capital letters are a big clue which, as you say, are often missing on Twitter. A dictionary check to remove valid English words is of limited use because Twitter texts include a huge number of abbreviations and they're often unique.

Perhaps PArt of Speech tagging, and frequency analysis can both be used to help improve detection of proper nouns?

Lewak answered 20/11, 2010 at 23:58 Comment(2)
Thank you for the reply. What I am planning to do is, using the new feature set which includes entities both in capital and lowercase letter to generate the Stanford NLP serailiser and then use Stanford NER to annotate it. It should have worked as I believe but somehow after I get all the things working and get the serializer and run stanford NER on it, its naming all entities as PERSON although I have only one entity annotated as PERS in tarining data.Silly
POS tagging would have been better I suppose, but am already too far away using Stanford NER and am curious to work on it to get it run for lowercase too.Silly
C
1

The question is a bit old, but somebody else may be able to benefit from this idea.

One way to potentially train a classifier for lower case would be to run the upper case classifier that you already have against a large data set of proper English, then process that tagged text to remove case. Then you have a tagged corpus that you can use to train a new classifier. This new classifier won't be perfect against Twitter because of the peculiarities of tweets, but it's a quick way to bootstrap it.

Curkell answered 15/6, 2012 at 19:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.