Transformers are a bit different than the other spacy models, but you can use
doc._.trf_data.tensors[1]
.
The vectors for the individual BPE (Byte Pair Encoding) token-pieces are in doc._.trf_data.tensors[0]
. Note that I use the term token-pieces rather than tokens, to prevent confusion between spacy tokens and the tokens that are produced by the BPE tokenizer.
E.g., in our case the spacy-tokens are:
for i, spacy_tok in enumerate(doc):
print(f"spacy-token {i + 1}: {spacy_tok.text}")
spacy-token 1: The
spacy-token 2: quick
spacy-token 3: brown
spacy-token 4: fox
spacy-token 5: jumps
spacy-token 6: over
spacy-token 7: the
spacy-token 8: lazy
spacy-token 9: dog
and the token-pieces are:
for i, tok_piece in enumerate(doc._.trf_data.tokens['input_texts'][0]):
print(f"token-piece {i + 1}: {tok_piece}")
token-piece 1: <s>
token-piece 2: The
token-piece 3: Ġquick
token-piece 4: Ġbrown
token-piece 5: Ġfox
token-piece 6: Ġjumps
token-piece 7: Ġover
token-piece 8: Ġthe
token-piece 9: Ġlazy
token-piece 10: Ġdog
token-piece 11: </s>