Gradient of a Loss Function for an SVM
Asked Answered
L

1

9

I'm working on this class on convolutional neural networks. I've been trying to implement the gradient of a loss function for an svm and (I have a copy of the solution) I'm having trouble understanding why the solution is correct.

On this page it defines the gradient of the loss function to be as follows: Class course notes of cs231n In my code I my analytic gradient matches with the numeric one when implemented in code as follows:

 dW = np.zeros(W.shape) # initialize the gradient as zero

  # compute the loss and the gradient
  num_classes = W.shape[1]
  num_train = X.shape[0]
  loss = 0.0
  for i in xrange(num_train):
    scores = X[i].dot(W)
    correct_class_score = scores[y[i]]
    for j in xrange(num_classes):
      if j == y[i]:
        if margin > 0:
            continue
      margin = scores[j] - correct_class_score + 1 # note delta = 1
      if margin > 0:
        dW[:, y[i]] += -X[i]
        dW[:, j] += X[i] # gradient update for incorrect rows
        loss += margin

However, it seems like, from the notes, that dW[:, y[i]] should be changed every time j == y[i] since we subtract the the loss whenever j == y[i]. I'm very confused why the code is not:

  dW = np.zeros(W.shape) # initialize the gradient as zero

  # compute the loss and the gradient
  num_classes = W.shape[1]
  num_train = X.shape[0]
  loss = 0.0
  for i in xrange(num_train):
    scores = X[i].dot(W)
    correct_class_score = scores[y[i]]
    for j in xrange(num_classes):
      if j == y[i]:
        if margin > 0:
            dW[:, y[i]] += -X[i]
            continue
      margin = scores[j] - correct_class_score + 1 # note delta = 1
      if margin > 0:
        dW[:, j] += X[i] # gradient update for incorrect rows
        loss += margin

and the loss would change when j == y[i]. Why are they both being computed when J != y[i]?

Loriannlorianna answered 26/7, 2016 at 4:52 Comment(7)
" from the notes, that dW[:, y[i]] should be changed every time j == y[i] since we subtract the the loss whenever j == y[i]. " Isn't the summation symbol summing over j NOT equal to y[i]?Ectomere
Looking at it now, that does seem to be the case. What's throwing me off is when it's written "For the other rows where j !=yi the gradient is...". It sounds like the first one is in the case where j==yi. What is the correct implication here? Also (maybe related), why is there a sum in the first function but not in the second?Loriannlorianna
There are gradients with respect to different variables here. The first one is with respect to j == y_i (note that on the left, it's grad_{y_i}), whose expression involves a sum of all the j's not equal to y_i; the second one is with respect to each j not equal to y_i.Ectomere
Ahh, now I see. Why is there only a summation in the first and not in the second though? In the code they are being run the same amount of times using the same comparison...Loriannlorianna
Didn't look at the definitions of your L_i, w_i and didn't look at your context, so I am not sure. But no, the code is fine. You are doing inner loop over j, and 1. you add to dW[: y[i]] for each j not equal to y[i] 2. you add to dW[: j] for each j not equal to y[i]. In step 2 you are adding to a different index in the array for each j, so no, there is no summation there.Ectomere
Got it, thanks. If you'd like, you can copy your comments into an answer and I'd be happy to upvote and accept it.Loriannlorianna
Hi David, can you explain why dW[:, y[i]] += -X[i] and dW[:, j] += X[i] do? I still feel confused.Sinecure
B
9

I don't have enough reputation to comment, so I am answering here. Whenever you compute loss vector for x[i], ith training example and get some nonzero loss, that means you should move your weight vector for the incorrect class (j != y[i]) away by x[i], and at the same time, move the weights or hyperplane for the correct class (j==y[i]) near x[i]. By parallelogram law, w + x lies in between w and x. So this way w[y[i]] tries to come nearer to x[i] each time it finds loss>0.

Thus, dW[:,y[i]] += -X[i] and dW[:,j] += X[i] is done in the loop, but while update, we will do in direction of decreasing gradient, so we are essentially adding X[i] to correct class weights and going away by X[i] from weights that miss classify.

Browse answered 25/5, 2017 at 12:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.