The issue with models accuracy
The problem with all models is that they don't have 100% accuracy and even using a bigger model doesn't help to recognize dates. Here are the accuracy values (F-score, precision, recall) for NER models--they are all around 86%.
document_string = """
Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST
The patient was referred by Dr. Jacob Austin.
Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST
Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST
The patient was referred by
Dr. Jayden Green Olivia.
With small model two date items are labelled as 'PERSON':
import spacy
nlp = spacy.load('en')
sents = nlp(document_string)
[ee for ee in sents.ents if ee.label_ == 'PERSON']
# Out:
# [Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]
With a larger model en_core_web_md
the results are even worse in terms of precision, as there are three misclassified entities.
nlp = spacy.load('en_core_web_md')
sents = nlp(document_string)
# Out:
#[Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# Janury,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]
I also tried other models (xx_ent_wiki_sm
, en_core_web_md
) and they don't bring any improvement as well.
What about using rules to improve accuracy?
In the small example not only the document seems to have a clear structure, but the misclassified entities are all dates. So why not combine the initial model with a rule-based component?
The good news is that in Spacy:
it's possible can combine statistical and rule-based components in a
variety of ways. Rule-based components can be used to improve the
accuracy of statistical models
So, by following the example and using the dateparser library (a parser for human readable dates) I've put together a rule-based component that works very well on this example:
from spacy.tokens import Span
import dateparser
def expand_person_entities(doc):
new_ents = []
for ent in doc.ents:
# Only check for title if it's a person and not the first token
if ent.label_ == "PERSON":
if ent.start != 0:
# if person preceded by title, include title in entity
prev_token = doc[ent.start - 1]
if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
# if entity can be parsed as a date, it's not a person
if dateparser.parse(ent.text) is None:
doc.ents = new_ents
return doc
# Add the component after the named entity recognizer
# nlp.remove_pipe('expand_person_entities')
nlp.add_pipe(expand_person_entities, after='ner')
doc = nlp(document_string)
[(ent.text, ent.label_) for ent in doc.ents if ent.label_=='PERSON']
# Out:
# [(‘Wes Scott', 'PERSON'),
# ('Dr. Jacob Austin', 'PERSON'),
# ('Robert Clowson', 'PERSON'),
# ('Dr. John Douglas', 'PERSON'),
# ('Dr. Jayden Green Olivia', 'PERSON')]