I want to use spacy to tokenize sentences to get a sequence of integer token-ids that I can use for downstream tasks. I expect to use it something like below. Please fill in ???
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')
# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at ")
doc = nlp(text)
idxs = ???
print(idxs)
I want the output to be something like:
array([ 8045, 70727, 24304, 96127, 44091, 37596, 24524, 35224, 36253])
Preferably the integers refers to some special embedding id in en_core_web_lg
..
spacy.io/usage/vectors-similarity does not give a hint what attribute in doc to look for.
I asked this on crossvalidated but it was determined as OT. Proper terms for googling/describing this problem is also helpful.
string
toint
hashing function. then do{word: hash(word) for word in words}
to build the mapping you need. no need for spaCy. Let me know if you need this to be fleshed out. I assume you got it from here. – Masked