Tokenizing an HTML document
Asked Answered
I

2

7

I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token. Here's my code:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en', vectors=False, parser=False, entity=False)

nlp.tokenizer.add_special_case(u'<i>', [{ORTH: u'<i>'}])
nlp.tokenizer.add_special_case(u'</i>', [{ORTH: u'</i>'}])

doc = nlp('Hello, <i>world</i> !')

print([e.text for e in doc])

The output is:

['Hello', ',', '<', 'i', '>', 'world</i', '>', '!']

If I put spaces around the tags, like this:

doc = nlp('Hello, <i> world </i> !')

The output is as I want it:

['Hello', ',', '<i>', 'world', '</i>', '!']

but I'd like avoiding complicated pre-processing to the HTML.

Any idea how can I approach this?

Interplanetary answered 29/11, 2017 at 9:58 Comment(5)
Sorry for the question, but what is the purpose of this? Why do you need to do it?Marley
It's data for a NER model I'm training. I'd like to keep tags such as i and b as features for the model.Interplanetary
Did you check - github.com/explosion/spaCy/issues/1061 ?Parturifacient
Why don't you just use an existing HTML parser, like docs.python.org/3.6/library/html.parser.htmlLoan
I need to tokenize the document, so HTML parser on its own will not suffice. Following this lead, I can think of using the parser to replace tags with a special tokens, and then tokenize. Is that what you mean?Interplanetary
G
3

You need to create a custom Tokenizer.

Your custom Tokenizer will be exactly as spaCy's tokenizer but it will have '<' and '>' symbols removed from prefixes and suffixes and also it will add one new prefix and one new suffix rule.

Code:

import spacy
from spacy.tokens import Token
Token.set_extension('tag', default=False)

def create_custom_tokenizer(nlp):
    from spacy import util
    from spacy.tokenizer import Tokenizer
    from spacy.lang.tokenizer_exceptions import TOKEN_MATCH
    prefixes =  nlp.Defaults.prefixes + ('^<i>',)
    suffixes =  nlp.Defaults.suffixes + ('</i>$',)
    # remove the tag symbols from prefixes and suffixes
    prefixes = list(prefixes)
    prefixes.remove('<')
    prefixes = tuple(prefixes)
    suffixes = list(suffixes)
    suffixes.remove('>')
    suffixes = tuple(suffixes)
    infixes = nlp.Defaults.infixes
    rules = nlp.Defaults.tokenizer_exceptions
    token_match = TOKEN_MATCH
    prefix_search = (util.compile_prefix_regex(prefixes).search)
    suffix_search = (util.compile_suffix_regex(suffixes).search)
    infix_finditer = (util.compile_infix_regex(infixes).finditer)
    return Tokenizer(nlp.vocab, rules=rules,
                     prefix_search=prefix_search,
                     suffix_search=suffix_search,
                     infix_finditer=infix_finditer,
                     token_match=token_match)



nlp = spacy.load('en_core_web_sm')
tokenizer = create_custom_tokenizer(nlp)
nlp.tokenizer = tokenizer
doc = nlp('Hello, <i>world</i> !')
print([e.text for e in doc])
Godewyn answered 1/10, 2018 at 16:23 Comment(0)
V
1

For the record, it might be that this has become easier: With the current version of Spacy, you don't have to create a custom tokenizer anymore. It suffices to 1. extend the infixes (to ensure tags are separated from words), and 2. add the tags as special cases:

import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_trf")

infixes = nlp.Defaults.infixes + [r'(<)']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.tokenizer.add_special_case(f"<i>", [{ORTH: f"<i>"}])    
nlp.tokenizer.add_special_case(f"</i>", [{ORTH: f"</i>"}])    

text = """Hello, <i>world</i> !"""

doc = nlp(text)
print([e.text for e in doc])

Prints:

['Hello', ',', '<i>', 'world', '</i>', '!']

(This is more or less a condensed version of https://mcmap.net/q/1625896/-how-to-add-custom-rules-to-spacy-tokenizer-to-break-down-html-in-single-tokens)

Vergos answered 19/2, 2021 at 9:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.