How to get token ids using spaCy (I want to map a text sentence to sequence of integers)

H

2

5

I want to use spacy to tokenize sentences to get a sequence of integer token-ids that I can use for downstream tasks. I expect to use it something like below. Please fill in ???

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at ")

doc = nlp(text)

idxs = ???

print(idxs)

I want the output to be something like:

array([ 8045, 70727, 24304, 96127, 44091, 37596, 24524, 35224, 36253])

Preferably the integers refers to some special embedding id in en_core_web_lg..

spacy.io/usage/vectors-similarity does not give a hint what attribute in doc to look for.

I asked this on crossvalidated but it was determined as OT. Proper terms for googling/describing this problem is also helpful.

Heptameter answered 8/11, 2018 at 16:45 Comment(2)

use any string to int hashing function. then do {word: hash(word) for word in words} to build the mapping you need. no need for spaCy. Let me know if you need this to be fleshed out. I assume you got it from here. – Masked 8/11, 2018 at 17:4

There may be many reasons not to write it myself. I want the identical pipeline mapping chunk of text to spaCy word-vectors. Ex, is the original tokenizer mapping many different chunks to the same token (ex by stemming or lowercasing)? I don't know. With 1m words in vocab I bet there's also a highly optimized hash-function. Anyway, I see many reasons to rely on an API rather than working around it – Heptameter 9/11, 2018 at 8:20

H

7

Solution;

import spacy
nlp = spacy.load('en_core_web_md')
text = (u"When Sebastian Thrun started working on self-driving cars at ")

doc = nlp(text)

ids = []
for token in doc:
    if token.has_vector:
        id = nlp.vocab.vectors.key2row[token.norm]
    else:
        id = None
    ids.append(id)

print([token for token in doc])
print(ids)
#>> [When, Sebastian, Thrun, started, working, on, self, -, driving, cars, at]
#>> [71, 19994, None, 369, 422, 19, 587, 32, 1169, 1153, 41]

Breaking this down;

# A Vocabulary for which __getitem__ can take a chunk of text and returns a hash
nlp.vocab 
# >>  <spacy.vocab.Vocab at 0x12bcdce48>
nlp.vocab['hello'].norm # hash
# >> 5983625672228268878


# The tensor holding the word-vector
nlp.vocab.vectors.data.shape
# >> (20000, 300)

# A dict mapping hash -> row in this array
nlp.vocab.vectors.key2row
# >> {12646065887601541794: 0,
# >>  2593208677638477497: 1,
# >>  ...}

# So to get int id of 'earth'; 
i = nlp.vocab.vectors.key2row[nlp.vocab['earth'].norm]
nlp.vocab.vectors.data[i]

# Note that tokens have hashes but may not have vector
# (Hence no entry in .key2row)
nlp.vocab['Thrun'].has_vector
# >> False

Heptameter answered 9/11, 2018 at 9:49 Comment(4)

I am glad you have an answer, but I am unclear as to why an index int would give you benefit over the hash int? Is it because you want to represent UNK tokens as None type? – Similarity 9/11, 2018 at 12:24

Basically, I want spaCy to handle everything regarding mapping of word to a word vector. When I have this index i can use it for torch.nn.EmbeddingBag to tune the (pretrained) embeddings. Many words can map to a single hash (if I understand right), in turn many hashes may map to a single word vector (which is only index I care about). – Heptameter 9/11, 2018 at 19:58

Oh yes that makes sense – Similarity 9/11, 2018 at 21:28

@Heptameter how come multiple words could have the same hash ? if they have the same hash they would have the same vector, which is not what tokenization is supposed to do (except for OOV words). Am I getting something wrong ? – Elastin 18/2, 2022 at 7:49

S

7

Spacy uses hashing on texts to get unique ids. All Token objects have multiple forms for different use cases of a given Token in a Document

If you just want the normalised form of the Tokens then use the .norm attribute which is a integer representation of the text (hashed)

>>> import spacy
>>> nlp = spacy.load('en')
>>> text = "here is some test text"
>>> doc = nlp(text)
>>> [token.norm for token in doc]
[411390626470654571, 3411606890003347522, 7000492816108906599, 1618900948208871284, 15099781594404091470]

You can also use other attributes such as the lowercase integer attribute .lower or many other things. Use help() on the Document or Token to get more information.

>>> help(doc[0])
Help on Token object:

class Token(builtins.object)
 |  An individual token – i.e. a word, punctuation symbol, whitespace,
 |  etc.
 |  
...

Similarity answered 8/11, 2018 at 17:38 Comment(1)

Thank you this helped me. My understanding was from ex. Glove embeddings where 1m tokens is represented as [1e6,384] tensor so if I know the hash-function mapping text to token to integer [0,1e6-1] I can safely load the embeddings into tensor and retrain it as I please using PyTorch, TensorFlow etc. I was hoping spaCy would provide this hash function (pipeline). It does and now I found it, see answer below. – Heptameter 9/11, 2018 at 9:34

H

7