spaCy - Tokenization of Hyphenated words

Asked 25/9, 2019 at 20:16 Answered 20/7, 2021 at 8:36

Good day SO,

I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example:

Example:

Sentence: "up-scaled"
Tokens: ['up', '-', 'scaled']
Expected: ['up-scaled']

For now, my solution is to use the matcher:

matcher = Matcher(nlp.vocab)
pattern = [{'IS_ALPHA': True, 'IS_SPACE': False},
           {'ORTH': '-'},
           {'IS_ALPHA': True, 'IS_SPACE': False}]

matcher.add('HYPHENATED', None, pattern)

def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    #print(doc)
    return doc

nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp(text)

However, this will cause an expected issue below:

Example 2:

Sentence: "I know I will be back - I had a very pleasant time"
Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time']
Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time']

Is there a way where I can process only words separated by hyphens that do not have spaces between the characters? So that words like 'up-scaled' will be matched and combined into a single token, but not '.. back - I ..'

Thank you very much

EDIT: I have tried the solution posted: Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

However, I didn't use this solution because it resulted in wrong tokenization of words with apostrophes (') and Numbers with decimals:

Sentence: "It's"
Tokens: ["I", "t's"]
Expected: ["It", "'s"]

Sentence: "1.50"
Tokens: ["1", ".", "50"]
Expected: ["1.50"]

That is why I used Matcher instead of trying to edit the regex.

Thrice answered 25/9, 2019 at 20:16 Comment(5)

I have added additional details to the question, hopefully it provides more information to the problem and further distinguish the issue. Is it possible to remove the duplicate tag? – Thrice 25/9, 2019 at 20:37

Did you try the code from the docs? – Juryrigged 25/9, 2019 at 20:55

I have used the examples from the link as well, and the example in: Customizing spaCy’s Tokenizer class, I am unable to produce a regex that can handle all of the cases above.. – Thrice 25/9, 2019 at 21:11

See pastebin.com/e6RmfjQA – Juryrigged 25/9, 2019 at 22:28

Thank you, this helped! But I realised that I have to handle all contractions with apostrophes again. I compiled a list of contractions and manually inserted them into the ruleset. – Thrice 26/9, 2019 at 9:32

The Matcher is not really the right tool for this. You should modify the tokenizer instead.

If you want to preserve how everything else is handled and only change the behavior for hyphens, you should modify the existing infix pattern and preserve all the other settings. The current English infix pattern definition is here:

https://github.com/explosion/spaCy/blob/58533f01bf926546337ad2868abe7fc8f0a3b3ae/spacy/lang/punctuation.py#L37-L49

You can add new patterns without defining a custom tokenizer, but there's no way to remove a pattern without defining a custom tokenizer. So, if you comment out the hyphen pattern and define a custom tokenizer:

import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\-\*^](?=[0-9-])",
            r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        ]
    )

    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)


nlp = spacy.load("en")
nlp.tokenizer = custom_tokenizer(nlp)
print([t.text for t in nlp("It's 1.50, up-scaled haven't")])
# ['It', "'s", "'", '1.50', "'", ',', 'up-scaled', 'have', "n't"]

You do need to provide the current prefix/suffix/token_match settings when initializing the new Tokenizer to preserve the existing tokenizer behavior. See also (for German, but very similar): https://mcmap.net/q/825449/-is-it-possible-to-change-the-token-split-rules-for-a-spacy-tokenizer

Edited to add (since this does seem unnecessarily complicated and you really should be able to redefine the infix patterns without loading a whole new custom tokenizer):

If you have just loaded the model (for v2.1.8) and you haven't called nlp() yet, you can also just replace the infix_re.finditer without creating a custom tokenizer:

nlp = spacy.load('en')
nlp.tokenizer.infix_finditer = infix_re.finditer

There's a caching bug that should hopefully be fixed in v2.2 that will let this work correctly at any point rather than just with a newly loaded model. (The behavior is extremely confusing otherwise, which is why creating a custom tokenizer has been a better general-purpose recommendation for v2.1.8.)

Tamera answered 26/9, 2019 at 7:55 Comment(3)

Thank you for the insight, but for some reason it affects the apostrophes too. For example, 'Haven't' does not get tokenized into 'Have', 'n't'. I guess I cant catch all edge cases, but thanks for the help! – Thrice 26/9, 2019 at 9:29

Hmm, that's kind of unexpected. I'll see if I can figure out what's going on. – Tamera 26/9, 2019 at 10:11

Okay, I missed one of the parameters (rules) to Tokenizer, which is what handles the contractions and other special cases and exceptions. – Tamera 26/9, 2019 at 10:19

If nlp = spacy.load('en') throws error, use nlp = spacy.load("en_core_web_sm")

Actress answered 20/7, 2021 at 8:36 Comment(0)

Recommended topics

Hot tags