What's the major difference between glove and word2vec?
Asked Answered
S

2

26

What is the difference between word2vec and glove? Are both the ways to train a word embedding? if yes then how can we use both?

Spinet answered 10/5, 2019 at 6:10 Comment(0)
I
22

Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.

Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.

GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.

Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)

I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)

Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.

You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.

Intone answered 10/5, 2019 at 6:50 Comment(0)
E
18

Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.

The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.

Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.

The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.

The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.

If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:

  • As one might expect, ice co-occurs more frequently with solid than it does with gas, whereas steam co-occurs more frequently with gas than it does with solid.
  • Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
  • Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.

However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.

In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

Effects answered 29/9, 2020 at 12:43 Comment(2)
Awesome explanation. @Effects Could you elaborate on the last paragraph of your post? What does it mean that word2vec uses negative sampling? Why it results in the clustering of words into a cone in the vector space? How does it apply in the real data? Do you mean word2vec is much faster than glove? Is there any condition or assumption for it to happen?Click
-ve sampling is training with non-co-occurring word pairs (by picking a random word from the vocab) and thus training the model to keep those word pairs separate, in addition to training the model to keep co-occurring pairs close (in the vector space). Then rather than requiring your model to output a vector of probabilities of each word in your vocabulary co-occurring with x (softmax), you ask your model to give the probability that a word pair is co-occurring or not, (binary classification using a sigmoid distribution). Does this cause clustering because you have more -ve samples than +ve?Catholicity

© 2022 - 2024 — McMap. All rights reserved.