SPACY - Confusion about word vectors and tok2vec
Asked Answered
F

1

6

it would be really helpful for me if you would help me understand some underlying concepts about Spacy.

I understand some spacy models have some predefined static vectors, for example, for the Spanish models these are the vectors generated by FastText. I also understand that there is a tok2vec layer that generates vectors from tokens, and this is used for example as the input of the NER components of the model.

If the above is correct, then I have some questions:

  • Does the NER component also use the static vectors?
    • If yes, then where does the tok2vec layer comes into play?
    • If no, then is there any advantage on using the lg or md models if you only intend to use the model for e.g. the NER component?
  • Is the tok2vec layer already trained for pretrained downloaded models, e.g. Spanish?
  • If I replace the NER component of a pretrained model, does it keep the tok2vec layer untouched i.e. with the learned weights?
  • Is the tok2vec layer also trained when I train a NER model?
  • Would the pretrain command help the tok2vec layer learn some domain-specific words that may be OOV?

Thanks a lot!

Frasch answered 7/10, 2020 at 23:18 Comment(1)
Some related discussion can be found here: https://mcmap.net/q/1777417/-proper-way-to-add-new-vectors-for-oov-wordsRaving
R
6

Does the NER component also use the static vectors?

This is addressed in point 2 and 3 of my answer here.

Is the tok2vec layer already trained for pretrained downloaded models, e.g. Spanish?

Yes, the full model is trained, and the tok2vec layer is a part of it.

If I replace the NER component of a pretrained model, does it keep the tok2vec layer untouched i.e. with the learned weights?

No, not in the current spaCy v2. The tok2vec layer is part of the model, if you remove the model, you also remove the tok2vec layer. In the upcoming v3, you'll be able to separate these so you can in fact keep the tok2vec model separately, and share it between components.

Is the tok2vec layer also trained when I train a NER model?

Yes - see above

Would the pretrain command help the tok2vec layer learn some domain-specific words that may be OOV?

See also my answer at https://mcmap.net/q/1777417/-proper-way-to-add-new-vectors-for-oov-words

If you have further questions - happy to discuss in the comments!

Raving answered 8/10, 2020 at 9:1 Comment(1)
Thank you very much Sofie for the detailed answer. I'm still not quite sure I understand how the tok2vec handles OOV words. I'm insisting on this because I want to understand how much I need to worry about this words for my NER. E.g. there is no static vector for a misspelled word "hipretension", but if I understand correctly the tok2vec should learn to produce a vector for this when I train my NER (given that the word is in my data). I'm still not sure if pretraining would help in this scenario or not. Would it make sense to augment the data with typos to help predict misspellings as well?Frasch

© 2022 - 2024 — McMap. All rights reserved.