I know there are a lot of resources out there for this problem, but I could not get spaCy to do exactly what I want.
I would like to add rules to my spaCy tokenizer so that HTML tags (such as <br/>
etc...) in my text would be a single token.
I am right now using the "merge_noun_chunks" pipe, so I get tokens like this one:
"documentation<br/>The Observatory Safety System"
(this is a single token)
I would like to add a rule so that this would get split into 3 tokens:
"documentation", "<br/>", "The Observatory Safety System"
I've looked up a lot of resources: here, also here. But I couldn't get that to work in my case
I have tried this:
infix_re = re.compile(r'''<[\w+]\/>''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
I am not sure I understand exactly what changing the infix does. Should I also remove <
from prefixes as suggested here?