Simple vanilla RNN doesn't pass gradient check
Asked Answered
S

1

6

I recently tried to implement a vanilla RNN from scratch. I implemented everything and even ran a seemingly OK example! yet I noticed the gradient check is not successful! and only some parts (specifically weight and bias for the output) pass the gradient check while other weights (Whh, Whx) don't pass it.

I followed karpathy/corsera's implementation and made sure everything is implemented. Yet karpathy/corsera's code passes the gradient check and mine doesn't. I have no clue at this point, what is causing this!

Here is the snippets responsible for backward pass in the original code :

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    
    gradients['dWya'] += np.dot(dy, a.T)
    gradients['dby'] += dy
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    return gradients
    
def rnn_backward(X, Y, parameters, cache):
    # Initialize gradients as an empty dictionary
    gradients = {}
    
    # Retrieve from cache and parameters
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    
    # each one should be initialized to zeros of the same dimension as its corresponding parameter
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])
    
    ### START CODE HERE ###
    # Backpropagate through time
    for t in reversed(range(len(X))):
        dy = np.copy(y_hat[t])
        # this means, subract the correct answer from the predicted value (1-the predicted value which is specified by Y[t])
        dy[Y[t]] -= 1
        gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
    ### END CODE HERE ###
    
    return gradients, a

and this is my implementation:

def rnn_cell_backward(self, xt, h, h_prev, output, true_label, dh_next):
    """
        Runs a single backward pass once.
        Inputs:
        - xt: The input data of shape (Batch_size, input_dim_size)
        - h:  The next hidden state at timestep t(which comes from the forward pass)
        - h_prev: The previous hidden state at timestep t-1
        - output : The output at the current timestep
        - true_label: The label for the current timestep, used for calculating loss
        - dh_next: The gradient of hidden state h (dh) which in the beginning
            is zero and is updated as we go backward in the backprogagation.
            the dh for the next round, would come from the 'dh_prev' as we will see shortly!
            Just remember the backward pass is essentially a loop! and we start at the end 
            and traverse back to the beginning!

        Returns : 
        - dW1 : The gradient for W1
        - dW2 : The gradient for W2
        - dW3 : The gradient for W3
        - dbh : The gradient for bh
        - dbo : The gradient for bo
        - dh_prev : The gradient for previous hiddenstate at timestep t-1. this will be used
        as the next dh for the next round of backpropagation.
        - per_ts_loss  : The loss for current timestep.
    """
    e = np.copy(output)
    # correct idx for each row(sample)!
    idxs = np.argmax(true_label, axis=1)
    # number of rows(samples) in our batch
    rows = np.arange(e.shape[0])
    # This is the vectorized version of error_t = output_t - label_t or simply e = output[t] - 1
    # where t refers to the index in which label is 1. 
    e[rows, idxs] -= 1
    # This is used for our loss to see how well we are doing during training.
    per_ts_loss = output[rows, idxs].sum()

    # must have shape of W3 which is (vocabsize_or_output_dim_size, hidden_state_size)
    dW3 = np.dot(e.T, h)
    # dbo = e.1, since we have batch we use np.sum
    # e is a vector, when it is subtracted from label, the result will be added to dbo
    dbo = np.sum(e, axis=0)
    # when calculating the dh, we also add the dh from the next timestep as well
    # when we are in the last timestep, the dh_next is initially zero.
    dh = np.dot(e,  self.W3) + dh_next  # from later cell
    # the input part
    dtanh = (1 - h * h) * dh
    # dbh = dtanh.1, we use sum, since we have a batch
    dbh = np.sum(dtanh, axis=0)

    # compute the gradient of the loss with respect to W1
    # this is actually not needed! we only care about tune-able
    # parameters, so we are only after, W1,W2,W3, db and do
    # dxt = np.dot(dtanh, W1.T)

    # must have the shape of (vocab_size, hidden_state_size)
    dW1 = np.dot(xt.T, dtanh)

    # compute the gradient with respect to W2
    dh_prev = np.dot(dtanh, self.W2)
    # shape must be (HiddenSize, HiddenSize)
    dW2 = np.dot(h_prev.T, dtanh)

    return dW1, dW2, dW3, dbh, dbo, dh_prev, per_ts_loss

def rnn_layer_backward(self, Xt, labels, H, O):
    """
        Runs a full backward pass on the given data. and returns the gradients.
        Inputs: 
        - Xt: The input data of shape (Batch_size, timesteps, input_dim_size)
        - labels: The labels for the input data
        - H: The hiddenstates for the current layer prodced in the foward pass 
          of shape (Batch_size, timesteps, HiddenStateSize)
        - O: The output for the current layer of shape (Batch_size, timesteps, outputsize)

        Returns :
        - dW1: The gradient for W1
        - dW2: The gradient for W2
        - dW3: The gradient for W3
        - dbh: The gradient for bh
        - dbo: The gradient for bo
        - dh: The gradient for the hidden state at timestep t
        - loss: The current loss 

    """

    dW1 = np.zeros_like(self.W1)
    dW2 = np.zeros_like(self.W2)
    dW3 = np.zeros_like(self.W3)
    dbh = np.zeros_like(self.bh)
    dbo = np.zeros_like(self.bo)
    dh_next = np.zeros_like(H[:, 0, :])
    hprev = None

    _, T_x, _ = Xt.shape
    loss = 0
    for t in reversed(range(T_x)):

        # this if-else block can be removed! and for hprev, we can simply
        # use H[:,t -1, : ] instead, but I also add this in case it makes a
        # a difference! so far I have not seen any difference though!
        if t > 0:
            hprev = H[:, t - 1, :]
        else:
            hprev = np.zeros_like(H[:, 0, :])

        dw_1, dw_2, dw_3, db_h, db_o, dh_prev, e = self.rnn_cell_backward(Xt[:, t, :],
                                                                          H[:, t, :],
                                                                          hprev,
                                                                          O[:, t, :],
                                                                          labels[:, t, :],
                                                                          dh_next)
        dh_next = dh_prev
        dW1 += dw_1
        dW2 += dw_2
        dW3 += dw_3
        dbh += db_h
        dbo += db_o

        # Update the loss by substracting the cross-entropy term of this time-step from it.
        loss -= np.log(e)

    return dW1, dW2, dW3, dbh, dbo, dh_next, loss

I have commented everything and provided a minimal example to demonstrate this here:

My code (doesn't pass gradient check)

And here is the implementation that I used as my guide. This is from karpathy/Coursera and passes all the gradient checks!: original code

At this point I have no idea why this is not working. I'm a beginner in Python so, this could be why I can't find the issue.

Succinylsulfathiazole answered 30/4, 2019 at 9:33 Comment(0)
S
1

2 month later I think I found the culprit! I should have changed the following line :

# compute the gradient with respect to W2
dh_prev = np.dot(dtanh, self.W2)

to

# compute the gradient with respect to W2
# note the transpose here!
dh_prev = np.dot(dtanh, self.W2.T) 

When I was initially writing the backward pass, I only paid attention to the dimensions and that made me make this mistake. This is actually an example of messing features that can happen in mindless/blind reshaping/transposing(or not doing so!)
In order to get what has gone wrong here let me give an example.
Suppose we have a matrix of peoples features and we dedicated each row to each person, therefore our matrix would look like this :

      Features |  Age  | height(cm)  |  weight(kg)  | 
matrix =       |   20  |    185      |      75      |
               |   85  |    155      |      95      |
               |   40  |    205      |     120      |

Now if we make this into a numpy array we will have the following :

m = np.array([[20, 185, 75],
             [85, 155, 95],
             [40, 205, 120]])

A simple 3x3 array right?
Now the way we interpret our matrix is very important, here each row and each column has a specific meaning. Each person is described using a row, and each column is a specific feature vector.
So, you see there is a "structure" in the matrix we represent our data with.
In other words, each data item is represented as a row, and each column specifies a single feature. When multiplying with another matrix, this semantic should be paid attention to ,meaning, when two matrices are to be multiplied, each data row must have this semantic.
Lets have an example and make this more clear :
suppose we have two matrices :

 m1 = np.array([[20, 185, 75],
             [85, 155, 95],
             [40, 205, 120]])

 m2 = np.array([[0.9, 0.8, 0.85],
                [0.1, 0.5, 0.4],
                [0.6, 0.9, 0.8]])

these two matrices contain data that are arranged in rows, therefore, multiplying them would result in the correct answer, However altering the order of data using Transpose for example, will destroy the semantic and we will be multiplying unrelated data!
In my case I needed to transpose the second matrix it to make the order right for the operation at hand! and that fixed the gradient checking hopefully!

Succinylsulfathiazole answered 19/6, 2019 at 13:27 Comment(3)
4 years, 8 months later, I tried my implementation, which closely follows yours. My gradients are at 2.6e-07 and can't figure out where I went wrong. Should I just move on ? I'm paranoid that I missed something.Brause
Thats a good enough approximation and its expected to happen because of the floating point funkiness for a lack of a better word. if you implemented everything yourself, including the softmax() e.g., then you get'd slightly less accurate results than lets say if you tried sklearn.softmax that takes normalization, etc into account. if you use production ready libraries, you get much accurate results. try these two versions of softmax (softmax1=lambda x : return x/x.sum(axis=-1) def softmax2(x): ex = np.exp(x - np.max(x)) return ex/ex.sum(axis=-1) and see the result.Succinylsulfathiazole
Thanks Hossein. I actually tried with tensorflow functions for sigmoid , relu and CCE since beginning. I was able to fix this issue and got gradient at 1e-8 level for binary classification, by fixing and internal bug. I was also able to re-create code of Andrej Karpathy for char rnn , achieved 1e-8 or less with one row of data ( as Karpathy ) , but interestingly , same code that passed for one row, doesn't pass with 10 rows. So, eventually , I realized that numerical stability is not something we can beat and should move on at some point.Brause

© 2022 - 2024 — McMap. All rights reserved.