spacy split sentences with abbreviations
Asked Answered
A

2

5

spaCy splits a sentence incorrectly when there are dots for abbreviations.

import spacy
tool = spacy.load('en')
x = tool('It starts at 9:00 a.m. Eastern Standard Time.')
list(x.sents)

produces two sentences instead of one. How do I do this correctly?

Aconcagua answered 29/12, 2018 at 9:30 Comment(1)
Which model are you using sm, md or lg?Slime
H
7

If you are using the standard English models, en_core_web_sm or en_core_web_md or en_core_web_lg, the most common abbreviations like that one should be already handled:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('It starts at 9:00 a.m. Eastern Standard Time.')
>>> list(doc.sents)
[It starts at 9:00 a.m. Eastern Standard Time.]

However, if you have an abbreviation that is not recognized by the model you are using, you can use add_special_case to handle it properly. For example in the following case, Pres. is not recognized as an abbreviation, therefore two sentences are returned instead of just one:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('I want to talk to Pres. Michael')
>>> list(doc.sents)
[I want to talk to Pres., Michael]

You would have to load your own library of special cases in order to inform the model that this is an abbreviation, and not the end of a sentence. The verbatim text of the token (ORTH) can be whatever you want, and may also include the dot.

>>> from spacy.attrs import ORTH, LEMMA
>>> nlp.tokenizer.add_special_case('Pres.', [{ORTH: 'Pres', LEMMA: 'president'}])
>>> doc = nlp('I want to talk to Pres. Michael')
>>> list(doc.sents)
[I want to talk to Pres Michael]
Harr answered 2/10, 2019 at 18:5 Comment(0)
N
1

Following @augustomen 's anwser -- update for spaCy v3.5

loading the model from a shortcut is obsolete as of spaCy v3.0

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp('I want to talk to Pres. Michael')
>>> list(doc.sents)
[I want to talk to Pres., Michael]

tokenizer cannot overwrite token's lemma, use norm instead

>>> from spacy.attrs import ORTH, NORM
>>> nlp.tokenizer.add_special_case('Pres.', [{ORTH: 'Pres.', NORM: 'president'}])
>>> doc = nlp('I want to talk to Pres. Michael')
>>> list(doc.sents)
[I want to talk to Pres. Michael]
Nerin answered 3/4, 2023 at 16:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.