Is it possible to use spacy with already tokenized input?
Asked Answered
T

3

10

I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don't want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?

Here is an example about my question:

# I know that it does the following sucessfully :
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text = 'Hello, world.'
doc = nlp(raw_text)
for token in doc:
    print(token.pos_)

But I want to do something similar to the following:

import spacy
nlp = spacy.load('en_core_web_sm')
tokenized_text = ['Hello',',','world','.']
doc = nlp(tokenized_text)
for token in doc:
    print(token.pos_)

I know, it doesn't work, but is it possible to do something similar to that ?

Tamarind answered 3/12, 2018 at 13:17 Comment(2)
@Chirag yes but in that case, does the nlp still have access to the context or it produces the postag just by looking the word only ?Tamarind
Seems like a duplicate of #48170045Extravaganza
G
16

You can do this by replacing spaCy's default tokenizer with your own:

nlp.tokenizer = custom_tokenizer

Where custom_tokenizer is a function taking raw text as input and returning a Doc object.

You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:

def custom_tokenizer(text):
    tokens = []

    # your existing code to fill the list with tokens

    # replace this line:
    return tokens

    # with this:
    return Doc(nlp.vocab, tokens)

See the documentation on Doc.

If for some reason you cannot do this (maybe you don't have access to the tokenization function), you can use a dictionary:

tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']}

def custom_tokenizer(text):
    if text in tokens_dict:
        return Doc(nlp.vocab, tokens_dict[text])
    else:
        raise ValueError('No tokenization available for input.')

Either way, you can then use the pipeline as in your first example:

doc = nlp('Hello, world.')
Gilreath answered 3/12, 2018 at 14:55 Comment(1)
Thanks this is exactly what I was looking forTamarind
P
2

Use the Doc object

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")

sents = [['Hello', ',','world', '.']]
for sent in sents:
    doc = Doc(nlp.vocab, sent)
    for token in nlp(doc):
        print(token.text, token.pos_)
Paestum answered 16/3, 2022 at 3:33 Comment(2)
Thanks! this is helpful in case my input is already tokenized by an unknown tokenizer.Upper
Would this fail if a token is out of vocabulary?Cassation
B
-1

In case the tokenized text is not constant, another option is skipping tokanization:

spacy_doc = Doc(nlp.vocab, words=tokenized_text)
for pipe in filter(None, nlp.pipeline):
    pipe[1](spacy_doc)
Birch answered 19/5, 2021 at 5:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.