How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string?
Asked Answered
C

1

8

How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string?

More specifically, I have this sentence:

Once unregistered, the folder went away from the shell.

which gets tokenized as [Once/unregistered/,/the/folder/went/away/from/the/she/ll/.] by scapy 1.6.0. I don't want the substring shell to be cut into two different tokens she and ll.


Here is the code I use:

# To install spacy:
# sudo pip install spacy
# sudo python -m spacy.en.download parser # will take 0.5 GB

import spacy
nlp = spacy.load('en')

# https://spacy.io/docs/usage/processing-text
document = nlp(u'Once unregistered, the folder went away from the shell.')

for token in document:
    print('token.i: {2}\ttoken.idx: {0}\ttoken.pos: {3:10}token.text: {1}'.
      format(token.idx, token.text,token.i,token.pos_))

which outputs:

token.i: 0      token.idx: 0    token.pos: ADV       token.text: Once
token.i: 1      token.idx: 5    token.pos: ADJ       token.text: unregistered
token.i: 2      token.idx: 17   token.pos: PUNCT     token.text: ,
token.i: 3      token.idx: 19   token.pos: DET       token.text: the
token.i: 4      token.idx: 23   token.pos: NOUN      token.text: folder
token.i: 5      token.idx: 30   token.pos: VERB      token.text: went
token.i: 6      token.idx: 35   token.pos: ADV       token.text: away
token.i: 7      token.idx: 40   token.pos: ADP       token.text: from
token.i: 8      token.idx: 45   token.pos: DET       token.text: the
token.i: 9      token.idx: 49   token.pos: PRON      token.text: she
token.i: 10     token.idx: 52   token.pos: VERB      token.text: ll
token.i: 11     token.idx: 54   token.pos: PUNCT     token.text: .
Confessional answered 26/1, 2017 at 3:26 Comment(0)
C
8

spacy allows to add exceptions to the tokenizer.

Adding an exception to prevent the string shell from being split by the tokenizer can be done with nlp.tokenizer.add_special_case as follows:

import spacy
from spacy.symbols import ORTH, LEMMA, POS
nlp = spacy.load('en')

nlp.tokenizer.add_special_case(u'shell',
    [
        {
            ORTH: u'shell',
            LEMMA: u'shell',
            POS: u'NOUN'}
     ])

# https://spacy.io/docs/usage/processing-text
document = nlp(u'Once unregistered, the folder went away from the shell.')

for token in document:
    print('token.i: {2}\ttoken.idx: {0}\ttoken.pos: {3:10}token.text: {1}'.
      format(token.idx, token.text,token.i,token.pos_))

which outputs:

token.i: 0      token.idx: 0    token.pos: ADV       token.text: Once
token.i: 1      token.idx: 5    token.pos: ADJ       token.text: unregistered
token.i: 2      token.idx: 17   token.pos: PUNCT     token.text: ,
token.i: 3      token.idx: 19   token.pos: DET       token.text: the
token.i: 4      token.idx: 23   token.pos: NOUN      token.text: folder
token.i: 5      token.idx: 30   token.pos: VERB      token.text: went
token.i: 6      token.idx: 35   token.pos: ADV       token.text: away
token.i: 7      token.idx: 40   token.pos: ADP       token.text: from
token.i: 8      token.idx: 45   token.pos: DET       token.text: the
token.i: 9      token.idx: 49   token.pos: NOUN      token.text: shell
token.i: 10     token.idx: 54   token.pos: PUNCT     token.text: .
Confessional answered 26/1, 2017 at 3:43 Comment(7)
Any idea why adding a special case for a "[LOCATION]" string results into three tokens - "[", "LOCATION", "]"? Shouldn't it work for all strings?Graffito
@Graffito Two untested ideas: 1. add_special_case ignores punctuation 2. add_special_case takes a regular expression as argumentConfessional
Spacy works fine. It appears I made a mistake while copy-pasting tokenisation example from spacy site. Cheers.Graffito
Why is "shell" currently split by the tokenizer?Underpainting
@user1712447 I don't know, I haven't looked at the code. Probably a side effect of she'll -> she llConfessional
Sounds rather arbitrary and makes me wonder about other "initiatives" Spacy might be taking.Underpainting
However, this doesn't solve other similar cases like "sell", "well", etc. Is there a more general fix?Eldaelden

© 2022 - 2024 — McMap. All rights reserved.