How to add custom rules to spaCy tokenizer to break down HTML in single tokens?

I know there are a lot of resources out there for this problem, but I could not get spaCy to do exactly what I want.

I would like to add rules to my spaCy tokenizer so that HTML tags (such as <br/> etc...) in my text would be a single token.

I am right now using the "merge_noun_chunks" pipe, so I get tokens like this one:
"documentation<br/>The Observatory Safety System" (this is a single token)

I would like to add a rule so that this would get split into 3 tokens:
"documentation", "<br/>", "The Observatory Safety System"
I've looked up a lot of resources: here, also here. But I couldn't get that to work in my case

I have tried this:

    
    infix_re = re.compile(r'''<[\w+]\/>''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

I am not sure I understand exactly what changing the infix does. Should I also remove < from prefixes as suggested here?

One way to achieve this seems to involve making the tokenizer both

break up tokens containing a tag without whitespace, and
"lump" tag-like sequences as single tokens.

To split up tokens like the one in your example, you can modify the tokenizer infixes (in the manner described here):

infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

To ensure tags are regarded as single tokens, you can use "special cases" (see the tokenizer overview or the method docs). You would add special cases for opened, closed and empty tags, e.g.:

# open and close
for tagName in "html body i br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])    
    nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])    

# empty
for tagName in "br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])

Taken together:

import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_trf")
infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

for tagName in "html body i br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])    
    nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])    

for tagName in "br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])

This seems to yield the expected result. E.g., applying ...

text = """<body>documentation<br/>The Observatory <p> Safety </p> System</body>"""
print("Tokenized:")
for t in nlp(text):
    print(t)

... will print the tag in its entirety and on its own:

# ... snip
documentation
<br/>
The
# ... snip

I found the tokenizer's explain method quite helpful in this context. It gives you a breakdown of what was tokenized why.

Recommended topics

Hot tags