Is it a good idea to use word2vec for encoding of categorical features?

Asked 13/11, 2019 at 10:7 Answered 13/11, 2019 at 19:6

machine-learning nlp word2vec categorical-data feature-engineering

I am facing a binary prediction task and have a set of features of which all are categorical. A key challenge is therefore to encode those categorical features to numbers and I was looking for smart ways to do so. I stumbled over word2vec, which is mostly used for NLP, but I was wondering whether I could use it to encode my variables, i.e. simply take the weights of the neural net as the encoded features.

However, I am not sure, whether it is a good idea since, the context words, which serve as the input features in word2vec are in my case more or less random, in contrast to real sentences which word2vec was originially made for.

Do you guys have any advice, thoughts, recommendations on this?

Unshakable answered 13/11, 2019 at 10:7 Comment(0)

You should look into entity embedding if you are searching for a way to utilize embeddings for categorical variables.

google has a good crash course on the topic: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data
this is a good paper on arxiv written by a team from a Kaggle competition: https://arxiv.org/abs/1604.06737

Milky answered 13/11, 2019 at 10:13 Comment(0)

It's certainly possible to use the word2vec algorithm to train up 'dense embeddings' for things like keywords, tags, categories, and so forth. It's been done, sometimes beneficially.

Whether it's a good idea in your case will depend on your data & goals – the only way to know for sure is to try it, and evaluate the results versus your alternatives. (For example, if the number of categories is modest from a controlled vocabulary, one-hot encoding of the categories may be practical, and depending on the kind of binary classifier you use downstream, the classifier may itself be able to learn the same sorts of subtle interrelationships between categories that could also otherwise be learned via a word2vec model. On the other hand, if categories are very numerous & chaotic, the pre-step of 'compressing' them into a smaller-dimensional space, where similar categories have similar representational vectors, may be more helpful.)

That such tokens don't quite have the same frequency distributions & surrounding contexts as true natural language text may mean it's worth trying a wider range of non-default training options on any word2vec model.

In particular, if your categories don't have a natural ordering giving rise to meaningful near-neighbors relationships, using a giant window (so all words in a single 'text' are in each others' contexts) may be worth considering.

Recent versions of the Python gensim Word2Vec allow changing a parameter named ns_exponent – which was fixed at 0.75 in many early implementations, but at least one paper has suggested can usefully vary far from that value for certain corpus data and recommendation-like applications.

Fendley answered 13/11, 2019 at 19:6 Comment(1)

thanks for your reply. Do you by chance recall a paper which uses word2vec as encoding strategy? – Unshakable 16/11, 2019 at 12:22

Recommended topics

Hot tags