Is there any NER model that recognizes first and last names instead of just PERSON?
Asked Answered
L

1

7

Given a set of strings like:

"John Doe"
"Doe John"
"Albert Green"
"Greenshpan David"

...

I would like to run a NER model that will recognize the first name and last name. All English models I use (in Spacy, NLTK etc.) gives me PERSON entity.

Please advise if there is a model that already trained?

Desired output:

{"John": "First Name", "Doe": "Last Name"}
{"Doe": "Last Name", "John": "First Name"}
{"Albert": "First Name", "Green": "Last Name"}
{"Greenshpan": "Last Name", "David": "First Name"}
Lordship answered 13/3, 2022 at 21:43 Comment(0)
C
6

Well I think pretty much all important NER datasets on which these models were trained do not separate between first and last name. I would guess that in normal, full-sentence language the pattern last name first name is quite rare. My guess would even be that for a majority of the time it's mainly the context that determines which comes first. In normal written and spoken sentences the first name is pretty much always going to be first. In some list formats and specific databases it might be the other way around, although usually separated by a ,.

Also this separation is generally difficult and vague within language (even more than NER already is), since there are obvious cases like David Paul / Paul David where even for a human annotator it would be impossible to tell.

So what you could do is either:

  1. Handle this problem rule-based, e.g. one of the following ways:
    • If there is a , in the entity, assume it's lastname firstname, otherwise firstname lastname
    • If the sentence the name is in is full and grammatically correct, assume its firstname lastname, otherwise lastname firstname: for this you can use SpaCy's sentence segmentation to split into sentences and then feed the sentences to any model trained on the CoLA (Corpus of Linguistic Acceptability) task, see for example this demo: sample 1 (correct), sample 2 (incorrect)
    • Create a dataset of probable first names and last names from your or any large corpus e.g. by extracting entities and seeing everything after Mr., Dr. etc. as probable last names and entities that consist from a single word as probable first names. Perhaps there are databases as well containing popular first and last names. Use the collected dataset to check whether a part of a name entity occurs more often as first or more often as last name. If it's unknown assume the longer part is the last name.
  1. Train/Fine-tune a model as a token classification task, either by annotating data yourself or first collecting probable first and last names as described and then automatically annotating data (and optionally post-correcting it). The best way to go about this would probably be to fine-tune a transformer model like BERT or RoBERTa. They perform pretty strong on NER and would likely also perform quite well on a modified/more fine-granular version of it. Check out this course on how to fine-tune such a model.
Castled answered 13/3, 2022 at 22:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.