Finding gradient of a Caffe conv-filter with regards to input
Asked Answered
C

2

37

I need to find the gradient with regards to the input layer for a single convolutional filter in a convolutional neural network (CNN) as a way to visualize the filters.
Given a trained network in the Python interface of Caffe such as the one in this example, how can I then find the gradient of a conv-filter with respect to the data in the input layer?

Edit:

Based on the answer by cesans, I added the code below. The dimensions of my input layer is [8, 8, 7, 96]. My first conv-layer, conv1, has 11 filters with a size of 1x5, resulting in the dimensions [8, 11, 7, 92].

net = solver.net
diffs = net.backward(diffs=['data', 'conv1'])
print diffs.keys() # >> ['conv1', 'data']
print diffs['data'].shape # >> (8, 8, 7, 96)
print diffs['conv1'].shape # >> (8, 11, 7, 92)

As you can see from the output, the dimensions of the arrays returned by net.backward() are equal to the dimensions of my layers in Caffe. After some testing I've found that this output is the gradients of the loss with regards to respectively the data layer and the conv1 layer.

However, my question was how to find the gradient of a single conv-filter with respect to the data in the input layer, which is something else. How can I achieve this?

Cantaloupe answered 9/7, 2015 at 17:48 Comment(0)
D
29

Caffe net juggles two "streams" of numbers.
The first is the data "stream": images and labels pushed through the net. As these inputs progress through the net they are converted into high-level representation and eventually into class probabilities vectors (in classification tasks).
The second "stream" holds the parameters of the different layers, the weights of the convolutions, the biases etc. These numbers/weights are changed and learned during the train phase of the net.

Despite the fundamentally different role these two "streams" play, caffe nonetheless use the same data structure, blob, to store and manage them.
However, for each layer there are two different blobs vectors one for each stream.

Here's an example that I hope would clarify:

import caffe
solver = caffe.SGDSolver( PATH_TO_SOLVER_PROTOTXT )
net = solver.net

If you now look at

net.blobs

You will see a dictionary storing a "caffe blob" object for each layer in the net. Each blob has storing room for both data and gradient

net.blobs['data'].data.shape    # >> (32, 3, 224, 224)
net.blobs['data'].diff.shape    # >> (32, 3, 224, 224)

And for a convolutional layer:

net.blobs['conv1/7x7_s2'].data.shape    # >> (32, 64, 112, 112)
net.blobs['conv1/7x7_s2'].diff.shape    # >> (32, 64, 112, 112)

net.blobs holds the first data stream, it's shape matches that of the input images up to the resulting class probability vector.

On the other hand, you can see another member of net

net.layers

This is a caffe vector storing the parameters of the different layers.
Looking at the first layer ('data' layer):

len(net.layers[0].blobs)    # >> 0

There are no parameters to store for an input layer.
On the other hand, for the first convolutional layer

len(net.layers[1].blobs)    # >> 2

The net stores one blob for the filter weights and another for the constant bias. Here they are

net.layers[1].blobs[0].data.shape  # >> (64, 3, 7, 7)
net.layers[1].blobs[1].data.shape  # >> (64,)

As you can see, this layer performs 7x7 convolutions on 3-channel input image and has 64 such filters.

Now, how to get the gradients? well, as you noted

diffs = net.backward(diffs=['data','conv1/7x7_s2'])

Returns the gradients of the data stream. We can verify this by

np.all( diffs['data'] == net.blobs['data'].diff )  # >> True
np.all( diffs['conv1/7x7_s2'] == net.blobs['conv1/7x7_s2'].diff )  # >> True

(TL;DR) You want the gradients of the parameters, these are stored in the net.layers with the parameters:

net.layers[1].blobs[0].diff.shape # >> (64, 3, 7, 7)
net.layers[1].blobs[1].diff.shape # >> (64,)

To help you map between the names of the layers and their indices into net.layers vector, you can use net._layer_names.


Update regarding the use of gradients to visualize filter responses:
A gradient is normally defined for a scalar function. The loss is a scalar, and therefore you can speak of a gradient of pixel/filter weight with respect to the scalar loss. This gradient is a single number per pixel/filter weight.
If you want to get the input that results with maximal activation of a specific internal hidden node, you need an "auxiliary" net which loss is exactly a measure of the activation to the specific hidden node you want to visualize. Once you have this auxiliary net, you can start from an arbitrary input and change this input based on the gradients of the auxilary loss to the input layer:

update = prev_in + lr * net.blobs['data'].diff
Deth answered 6/8, 2015 at 5:2 Comment(12)
Thanks for the detailed answer, which gave me more insight into how blobs work. However, don't you still get the gradients with regards to the loss? Filter visualization is done through optimization of the input in order to maximize the filter activation. For that I need to get the gradients with regards to the input, not with regards to the loss. Based on that it seems that your answer doesn't really reflect the question, but does something similar to the answer given by cesans.Cantaloupe
So gradients of the data stream are essentially accumulated gradients in the neuron from all the weights leading to it?Ragland
@Ragland the gradients are computed according to the chain ruleDeth
Yes I understand that, each dE/dW is a partial derivative wrt to the weight. But net.backward(diffs=['data','conv1/7x7_s2']) returns values for all neurons in the layer rather than for all weights in the kernel, I find it a bit confusingRagland
In the same way that you can derive the loss .w.r.t weights, you can derive it w.r.t the inputs. The blobs (along the "data" stream) holds the derivatives w.r.t the data, while the net.layrs[k].blobs[i].diff holds the derivatives of the loss w.r.t the parameters (of the k-th layer and the i-th parameter). @RaglandDeth
'derivatives w.r.t the data', i.e. if the output of the neuron is x, would it by dE/dx?Ragland
@Ragland I suppose so.Deth
Yeah, because for 'loss' i returns 1, i.e. dE/dERagland
@Shai: OK, so if weight diffs seem OK (large in the early training stages, small in the later), but neuron diffs are always(!) zero - is that a bug?Ragland
@Ragland it seems like you have a question. Why don't you post it as such, with all the relevant details. It seems like this discussion is not appropriate for comments.Deth
No problem, I'll put it together, once I have all the detailsRagland
@Shai: I wrote a follow-up, #45273350, mb you could have a lookRagland
S
10

You can get the gradients in terms of any layer when you run the backward() pass. Just specify the list of layers when calling the function. To show the gradients in terms of the data layer:

net.forward()
diffs = net.backward(diffs=['data', 'conv1'])`
data_point = 16
plt.imshow(diffs['data'][data_point].squeeze())

In some cases you may want to force all layers to carry out backward, look at the force_backward parameter of the model.

https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto

Squawk answered 10/7, 2015 at 20:42 Comment(8)
Perfect, thanks! How would I get the exact same gradients used by SGD for tuning e.g. the parameters for the filters in conv1? Would diffs = net.backward(diffs=['loss', 'conv1']) give me that exactly or does caffe do some sort of operation on the gradients before making a step down the error surface?Cantaloupe
How the weight update is computed depends on the Solver. For SGD, at that point it includes the previous update if momentum is not zero as well as the learning rate and weight decay. However, according to the information here: caffe.berkeleyvision.org/tutorial/solver.html (Updating Parameters) and code here: github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp I guess that the values stored in diffare the final weight updates.Squawk
I am currently training my CNN using a loop with solver.step(1). I would like to find the gradients during each iteration, and I guess that I can simply add diffs = net.backward(diffs=['loss', 'conv1']) to that loop as the solver step automatically does a forward pass. Do you see any reason it might interfere with the training?Cantaloupe
Running net.backward() just computes the gradients and stores them, but doesn't update the parameters, so that shouldn't be a problem. Although you will be running the backward pass twice.Squawk
Yes, so training will be quite a bit slower I guess. Do you know how I could extract the gradients while training with only 1 backward pass? I couldn't find any options that allowed this.Cantaloupe
That's not possible from python (as far as I know). However, you could write your own solver in python using the gradients computed with backward() to update the parameters. If you do that check whether weight decay, momentum, etc. is included on diffs or not.Squawk
Thanks, I might do that. I've updated my original post based on your answer. While it does help me find the gradients with respect to the loss, I can't see how it answers my original question regarding a single conv-filter.Cantaloupe
where does data_point=16 come from? my diffs['data].shape = (1,3,227,227)Denysedenzil

© 2022 - 2024 — McMap. All rights reserved.