When using for example gensim, word2vec or a similar method for training your embedding vectors I was wonder what is a good ratio or is there a preferred ratio between the embedding dimension to vocabulary size ? Also how does that change with more data coming along ?
As I am still on the topic how would one chose a good window size when training your embedding vectors ?
I am asking this because I am not training my network with a real-life language dictionary, but rather the sentences would describe relationships between processes and files and other processes and so on. For example a sentence in my text corpus would look like:
smss.exe irp_mj_create systemdrive windows system32 ntdll dll DesiredAccess: Execute/Traverse, Synchronize, Disposition: Open, Options: , Attributes: n/a, ShareMode: Read, AllocationSize: n/a, OpenResult: Opened"
As you may imagine the variations are numberous but the question still remains how can I fine tune these hyperparameters the best way so that the embedding space will not over-fit but also have enough meaningful features for each word.
Thanks,
Gabriel