What is the preferred ratio between the vocabulary size and embedding dimension?
Asked Answered
D

2

22

When using for example gensim, word2vec or a similar method for training your embedding vectors I was wonder what is a good ratio or is there a preferred ratio between the embedding dimension to vocabulary size ? Also how does that change with more data coming along ?

As I am still on the topic how would one chose a good window size when training your embedding vectors ?

I am asking this because I am not training my network with a real-life language dictionary, but rather the sentences would describe relationships between processes and files and other processes and so on. For example a sentence in my text corpus would look like:

smss.exe irp_mj_create systemdrive windows system32 ntdll dll DesiredAccess: Execute/Traverse, Synchronize, Disposition: Open, Options: , Attributes: n/a, ShareMode: Read, AllocationSize: n/a, OpenResult: Opened"

As you may imagine the variations are numberous but the question still remains how can I fine tune these hyperparameters the best way so that the embedding space will not over-fit but also have enough meaningful features for each word.

Thanks,

Gabriel

Doradorado answered 27/1, 2018 at 19:50 Comment(1)
The dimension of the pre-trained embedding on Google News dataset is only 300, even thought its vocabulary size is extremely large.Towne
C
18

Ratio is not what you're aiming for

I don't recall any specific papers for this problem, but the question feels a bit weird - in general, if I'd have a great model but wanted to switch to a vocabulary that is twice or ten times bigger, I would not change the embedding dimensions.

IMHO they're quite orthogonal, unrelated parameters. The key factors for deciding on the optimal embedding dimension are mainly related to the availability of computing resources (smaller is better, so if there's no difference in results and you can halve the dimensions, do so), task and (most importantly) quantity of supervised training examples - the choice of embedding dimensions will determine how much you will compress / intentionally bottleneck the lexical information; larger dimensionality will allow your model to distinguish more lexical detail which is good if and only if your supervised data has enough information to use that lexical detail properly, but if it's not there, then the extra lexical information will overfit and a smaller embedding dimensionality will generalize better. So a ratio between the vocabulary size and the embedding dimension is not (IMHO, I can't give evidence, it's just practical experience) something to look at, since the best size for embedding dimension is decided by where you use the embeddings, not the data on which you train the embeddings.

In any case, this seems like a situation where your mileage will vary - any theory and discussion will be interesting, but your task and text domain is quite specific, findings of general NLP may or may not apply to your case, and it would be best to get empirical evidence for what works on your data. Train embeddings with 64/128/256 or 100/200/400 or whatever sizes, train models using each of those, and compare the effects; that'll take less effort (of people, not GPUs) than thinking about what the effects should be.

Consubstantiation answered 27/1, 2018 at 20:10 Comment(2)
OK maybe then I phrased the question in a not so happy manner. I believe I have seen places where people recommend the embedding vector to be much smaller than the vocabulary size to avoid overfitting. Anyway, as you noticed the problem I am looking to solve is a bit different than NLTK but has similar properties. I do get pretty nice results so far but I was simply wondering if and how better hyper parameter tuning would make things better in real world examples. In my case these are the only ones I work with so far: window_size, emd_size, vocab, data corpus.Doradorado
@GabrielBercea the effect size of embedding parameters doesn't tend to be large, a couple percentage points accuracy at most. Important if you're tuning a system that works in general and needs error reduction, but not that relevant for a proof of concept system.Consubstantiation
I
18

This Google Developers blog post says:

Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:

embedding_dimensions =  number_of_categories**0.25

That is, the embedding vector dimension should be the 4th root of the number of categories.

Interestingly, the Word2vec Wikipedia article says (emphasis mine):

Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting.

Assuming a standard-ish sized vocabulary of 1.5 million words, this rule of thumb comes surprisingly close:

50 == 1.5e6 ** 0.2751

If answered 29/3, 2019 at 7:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.