If you are using the standard English models, en_core_web_sm
or en_core_web_md
or en_core_web_lg
, the most common abbreviations like that one should be already handled:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('It starts at 9:00 a.m. Eastern Standard Time.')
>>> list(doc.sents)
[It starts at 9:00 a.m. Eastern Standard Time.]
However, if you have an abbreviation that is not recognized by the model you are using, you can use add_special_case
to handle it properly. For example in the following case, Pres.
is not recognized as an abbreviation, therefore two sentences are returned instead of just one:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('I want to talk to Pres. Michael')
>>> list(doc.sents)
[I want to talk to Pres., Michael]
You would have to load your own library of special cases in order to inform the model that this is an abbreviation, and not the end of a sentence. The verbatim text of the token (ORTH
) can be whatever you want, and may also include the dot.
>>> from spacy.attrs import ORTH, LEMMA
>>> nlp.tokenizer.add_special_case('Pres.', [{ORTH: 'Pres', LEMMA: 'president'}])
>>> doc = nlp('I want to talk to Pres. Michael')
>>> list(doc.sents)
[I want to talk to Pres Michael]