Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

Asked 12/9, 2018 at 11:16 Answered 30/12, 2020 at 17:1

Solved python-3.x nlp spacy

SpaCy Version: 2.0.11

Python Version: 3.6.5

OS: Ubuntu 16.04

My Sentence Samples:

Marketing-Representative- won't die in car accident.

Out-of-box implementation

Expected Tokens:

["Marketing-Representative", "-", "wo", "n't", "die", "in", "car", "accident", "."]

["Out-of-box", "implementation"]

SpaCy Tokens(Default Tokenizer):

["Marketing", "-", "Representative-", "wo", "n't", "die", "in", "car", "accident", "."]

["Out", "-", "of", "-", "box", "implementation"]

I tried creating custom tokenizer but it won't handle all edge cases as handled by spaCy using tokenizer_exceptions(Code below):

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Marketing-Representative- won't die in car accident.")
for token in doc:
    print(token.text)

Output:

Marketing-Representative-
won
'
t
die
in
car
accident
.

I need someone to guide me towards the appropriate way of doing this.

Either making changes in the regex above could do it or any other method or I even tried spaCy's Rule-Based Matcher but wasn't able to create rule to handle hyphens between more than 2 words e.g. "out-of-box" so that a Matcher can be created to be used with span.merge().

Either way I need to have words containing intra-word-hyphens to become single token as handled by Stanford CoreNLP.

Shortterm answered 12/9, 2018 at 11:16 Comment(0)

Although not documented at spacey usage site ,

It looks like that we just need to add regex for *fix we are working with, in this case infix.

Also, it appears we can extend nlp.Defaults.prefixes with custom regex

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

This will give you desired result. There is no need set default to prefix and suffix since we are not working with those.

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])

Result

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']

You may want to fix addon regex to make it more robust for other kind of tokens that are close to the applied regex.

Collide answered 18/9, 2018 at 6:34 Comment(4)

Thanks for your response. Your solution works well although still i'm not able to fix the trailing hyphen from the token("Marketing-Representative-") generated by custom tokenizer. I'm working on it though. – Shortterm 19/9, 2018 at 7:40

Why do the following; infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)") – Wapentake 19/3, 2019 at 11:45

And not just the following; infixes = nlp.Defaults.prefixes + (r"[-]~")? – Wapentake 19/3, 2019 at 11:46

What are the first and the last patterns for in infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")? – Wapentake 19/3, 2019 at 11:46

I am also wanting to modify spaCy's tokenizer to more closely match the semantics of CoreNLP. Pasted below is what I came up with, that addresses the hyphens issue in this thread (including the trailing hypens) and some additional fixes. I had to copy the default infix expressions and make modifications to them, but was able to simply append a new suffix expression:


import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS

def initializeTokenizer(nlp):

    prefixes = nlp.Defaults.prefixes 
    
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r'(?<=[0-9])[+\-\*^](?=[0-9-])',
            r'(?<=[{al}{q}])\.(?=[{au}{q}])'.format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            # REMOVE: commented out regex that splits on hyphens between letters:
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            # EDIT: remove split on slash between letters, and add comma
            #r'(?<=[{a}0-9])[:<>=/](?=[{a}])'.format(a=ALPHA),
            r'(?<=[{a}0-9])[:<>=,](?=[{a}])'.format(a=ALPHA),
            # ADD: ampersand as an infix character except for dual upper FOO&FOO variant
            r'(?<=[{a}0-9])[&](?=[{al}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
            r'(?<=[{al}0-9])[&](?=[{a}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
        ]
    )

    # ADD: add suffix to split on trailing hyphen
    custom_suffixes = [r'[-]']
    suffixes = nlp.Defaults.suffixes
    suffixes = tuple(list(suffixes) + custom_suffixes)

    infix_re = spacy.util.compile_infix_regex(infixes)
    suffix_re = spacy.util.compile_suffix_regex(suffixes)

    nlp.tokenizer.suffix_search = suffix_re.search
    nlp.tokenizer.infix_finditer = infix_re.finditer

Gordie answered 30/12, 2020 at 17:1 Comment(0)

Recommended topics

Hot tags