What does tokens and vocab mean in glove embeddings?
Asked Answered
H

2

11

I am using glove embeddings and I am quite confused about tokens and vocab in the embeddings. Like this one:

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

what does tokens and vocab mean, respectively? What is the difference?

Hemicrania answered 6/9, 2016 at 14:13 Comment(0)
P
9

In NLP tokens refers to the total number of "words" in your corpus. I put words in quotes because the definition varies by task. The vocab is the number of unique "words".

It should be the case that vocab <= tokens.

Propylaeum answered 6/9, 2016 at 20:26 Comment(0)
K
0

Tokens are obtained after training your corpus and they are not the same size as words.

A word of length 10, tokens of this word maybe 2 or 3 tokens, it basically represents how better you can represent your word and make it mean something to your model.

Kazbek answered 23/7, 2021 at 14:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.