Does spacy take as input a list of tokens?
Asked Answered
G

2

10

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either with spacy or any other NLP package ?

For now, I am using this spacy-based function to put a sentence (a unicode string) in the Conll format:

import spacy
nlp = spacy.load('en')
def toConll(string_doc, nlp):
   doc = nlp(string_doc)
   block = []
   for i, word in enumerate(doc):
          if word.head == word:
                  head_idx = 0
          else:
                  head_idx = word.head.i - doc[0].i + 1
          head_idx = str(head_idx)
          line = [str(i+1), str(word), word.lemma_, word.tag_,
                      word.ent_type_, head_idx, word.dep_]
          block.append(line)
   return block
conll_format = toConll(u"Donald Trump is the new president of the United States of America")

Output:
[['1', 'Donald', u'donald', u'NNP', u'PERSON', '2', u'compound'],
 ['2', 'Trump', u'trump', u'NNP', u'PERSON', '3', u'nsubj'],
 ['3', 'is', u'be', u'VBZ', u'', '0', u'ROOT'],
 ['4', 'the', u'the', u'DT', u'', '6', u'det'],
 ['5', 'new', u'new', u'JJ', u'', '6', u'amod'],
 ['6', 'president', u'president', u'NN', u'', '3', u'attr'],
 ['7', 'of', u'of', u'IN', u'', '6', u'prep'],
 ['8', 'the', u'the', u'DT', u'GPE', '10', u'det'],
 ['9', 'United', u'united', u'NNP', u'GPE', '10', u'compound'],
 ['10', 'States', u'states', u'NNP', u'GPE', '7', u'pobj'],
 ['11', 'of', u'of', u'IN', u'GPE', '10', u'prep'],
 ['12', 'America', u'america', u'NNP', u'GPE', '11', u'pobj']]

I would like to do the same while having as input a list of tokens...

Grigson answered 9/1, 2018 at 13:43 Comment(1)
If I have a list of trailing whitespaces along with the list of tokens, I can reconstruct the sentence as a string and use toConll function ...Grigson
V
13

You can run Spacy's processing pipeline against already tokenised text. You need to understand, though, that the underlying statistical models have been trained on a reference corpus that has been tokenised using some strategy and if your tokenisation strategy is significantly different, you may expect some performance degradation.

Here's how to go about it using Spacy 2.0.5 and Python 3. If using Python 2, you may need to use unicode literals.

import spacy; nlp = spacy.load('en_core_web_sm')
# spaces is a list of boolean values indicating if subsequent tokens
# are followed by any whitespace
# so, create a Spacy document with your tokenisation
doc = spacy.tokens.doc.Doc(
    nlp.vocab, words=['nuts', 'itch'], spaces=[True, False])
# run the standard pipeline against it
for name, proc in nlp.pipeline:
    doc = proc(doc)
Vastitude answered 10/1, 2018 at 12:6 Comment(2)
I used spacy.tokens.doc.Doc as you suggested in your code, and then I ran toConll function from line for i, word in enumerate(doc): but spacy still does a sentence segmentation based on the dependency parser: sometimes I get many rootsGrigson
What if I already have PoS-tagged tokens? Can I feed the tags to spaCy somehow as well? There is nothing about it in the documentation. The thing is that I have German social media texts, which are poorly tokenized and tagged by spaCy..Gatto
A
1

Yes, you just need to use nlp.tokenizer=nlp.tokenizer.tokens_from_list

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer=nlp.tokenizer.tokens_from_list
text="she went to school"
words=text.split()
doc = nlp(words)
for token in doc:   
  token_i = token.i+1
  if token.i==token.head.i: head_i=0
  else: head_i = token.head.i+1
  items=[token_i,token.text, token.lemma_, token.tag_, token.pos_, "_", head_i, token.dep_,"_","_"]
  print(items)

Output:

[1, 'she', '-PRON-', 'PRP', 'PRON', '_', 2, 'nsubj', '_', '_']
[2, 'went', 'go', 'VBD', 'VERB', '_', 0, 'ROOT', '_', '_']
[3, 'to', 'to', 'IN', 'ADP', '_', 2, 'prep', '_', '_']
[4, 'school', 'school', 'NN', 'NOUN', '_', 3, 'pobj', '_', '_']
Alamein answered 28/2, 2021 at 16:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.