word2vec: negative sampling (in layman term)? [closed]

Asked 9/1, 2015 at 12:31 Answered 31/5, 2019 at 19:58

Solved machine-learning nlp word2vec

120

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.

http://arxiv.org/pdf/1402.3722v1.pdf

Can anyone help , please?

Arteriosclerosis answered 9/1, 2015 at 12:31 Comment(1)

related: stats.stackexchange.com/questions/282425/… – Exine 30/5, 2017 at 0:33

198

The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have

      v_c . v_w
 -------------------
   sum_i(v_ci . v_w)

The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.

Galba answered 9/1, 2015 at 16:11 Comment(3)

thanks or the nice explanation. I think it's just sampling. but do you know why it called "negative"? – Haerr 22/1, 2016 at 5:56

The terminology is borrowed from classification, a common application of neural networks. There you have a bunch of positive and negative examples. With word2vec, for any given word you have a list of words that need to be similar to it (the positive class) but the negative class (words which are not similar to the targer word) is compiled by sampling. – Galba 30/1, 2016 at 18:28

The motivation explained in the answer is right. But it's a bit misleading b/c it sounds like we just replace the sum in the denominator w/ a smaller sum. In reality, negative sampling is a different statistical formulation of the problem. See word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method – Moya 13/8, 2023 at 14:21

Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.

What can be done?

Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).

The loss function in Word2vec is something like:

Which logarithm can decompose into:

With some mathematic and gradient formula (See more details at 6) it converted to:

As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).

Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.

Reference :

(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/

Masera answered 25/12, 2016 at 7:32 Comment(3)

Hi @amir, my initial question is I have some trouble , understanding the concept of negative sampling... – Arteriosclerosis 25/12, 2016 at 7:43

Very well explained and a bit more technical than the accepted answer. So a perfect SO situation: read the accepted answer to get the idea and then this answer to understand it in detail. – Constitutional 11/4, 2019 at 11:54

The last two equations for $j(\Theta)$ are not equal. It's an entirely different formulation of the objective. See word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method – Moya 13/8, 2023 at 14:18

I wrote an tutorial article about negative sampling here.

Why do we use negative sampling? -> to reduce computational cost

The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:

Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.

The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:

V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.

On the other hand, the probability distribution in SGNS is computed with:

c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.

With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.

How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.

With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.

In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.

In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.

Hadj answered 31/5, 2019 at 19:58 Comment(3)

T is for number of tokens (word occurences in a text). V for vocabulary (unique words) I would say. – Seasonal 3/4, 2020 at 20:18

If we set K as V -1, then negative sampling is same as the vanilla skip-gram model. Is my understanding correct? – Dropsical 23/6, 2020 at 14:4

@Dropsical the number of word vectors updated for each training sample is the same, but the training objective function will still be fundamentally different – Hadj 23/6, 2020 at 17:49

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags