SpaCy 3 Transformer Vector Token Alignment
Asked Answered
E

1

6

I'm using the SpaCy 3.0.1 together with the transformer model (en_core_web_trf).
When I previously used SpaCy transformers it was possible to get the transformer vectors from a Token or Span. In SpaCy 3 however it seems like you can only access the transformer vectors via the Doc (doc._.trf_data) without a proper alignment to the SpaCy tokens.

How can I get the alignment between SpaCy Tokens and Vectors/Wordpieces?
Or alternatively; is there some function that allows you to directly get the vectors for a Token or Span?

Embroidery answered 11/2, 2021 at 7:31 Comment(0)
E
9

When having a doc:

doc = nlp("Helsinki is the capital of Finland.")

Where the wordpieces are:

[['<s>',
  'H',
  'els',
  'inki',
  'Ġis',
  'Ġthe',
  'Ġcapital',
  'Ġof',
  'ĠFinland',
  '.',
  '</s>']]

Then you can access the alignment for example for the first token using the following code:

# Get the first spaCy Token, "Helsinki", and its alignment data
doc[0], doc._.trf_data.align[0].data

Output:

(Helsinki,
 array([[1],
        [2],
        [3]], dtype=int32))

Then you can use these indices to extract the respective vectors from doc._.trf_data.tensors.


Source:

https://applied-language-technology.mooc.fi/html/notebooks/part_iii/04_embeddings.html

(Provides also more detailed explanations and information about the usage of transformer in spacy)

Embroidery answered 11/2, 2021 at 7:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.