Receptive Fields on ConvNets (Receptive Field size confusion)

Asked 10/5, 2016 at 11:9 Answered 1/11, 2021 at 3:31

machine-learning computer-vision neural-network deep-learning conv-neural-network

I was reading this from a paper: "Rather than using relatively large receptive fields in the first conv. layers we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field."

How do they end up with a recpetive field of 7x7 ?

This is how i understand it: Suppose that we have one image that is 100x100.

1st layer: zero-pad the image and convole it with the 3x3 filter, output another 100x100 filtered image.

2nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output another 100x100 filtered image.

3nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output the final 100x100 filtered image.

What am I missing there ?

Doubletongued answered 10/5, 2016 at 11:9 Comment(0)

Here's one way to think of it. Consider the following small image, with each pixel numbered as such:

00 01 02 03 04 05 06
10 11 12 13 14 15 16
20 21 22 23 24 25 26
30 31 32 33 34 35 36
40 41 42 43 44 45 46
50 51 52 53 54 55 56
60 61 62 63 64 65 66

Now consider the pixel 33 at the center. With the first 3x3 convolution, the generated value at pixel 33 will incorporate the values of pixels 22, 23, 24, 32, 33, 34, 42, 43, and 44. But notice that each of those pixels will also incorporate their surrounding pixels' values as well.

With the next 3x3 convolution, pixel 33 will again incorporate the values of its surrounding pixels, but now, the value of those pixels incorporates their surrounding pixels from the original image. In effect, this means that the value of pixel 33 is governed by the values reaching out to a 5x5 "square of influence" you could say.

Each additional 3x3 convolution has the effect of stretching the effective receptive field by another pixel in each direction.

I hope that didn't just make it more confusing...

Talebearer answered 10/5, 2016 at 16:19 Comment(1)

This is a great answer, not sure why this isn't appearing at the top.... – Reichard 20/7, 2020 at 6:48

                  16          ------> Layer3
                13 14 15      ------->Layer2
              8 9 10 11 12    ------->Layer1
              1 2 3 4 5 6 7   ------->Input Layer

Let's consider in 1D instead of 2D for better clarity. Consider each numerical value as one pixel and each vertical level as a convolution layer. Now let's decide on receptive field(F) = 3, Padding(P)=0 and stride(S)=1 for each layer. W is number of 0's at each layer. Therefore by the formula:

             W_j+1 = ((W_j - F + 2P)/S +1)

In this case we have 7 pixels at Input Layer, so by using above formula you can easily calculate the number of layers at each above Layer. Now if you see the pixel named 16 at Layer3, it's receiving it's inputs from 13 14 and 15 since F=3. Similarly 13, 14 and 15 are getting their inputs from (8 9 10),(9 10 11) and (10 11 12) respectively for the same reasons as S=1 and F=3.

Similarly,8 will be getting inputs from (1 2 3), 9 from (2 3 4) ,......., 12 from (5 6 7).

So if you see w.r.t to 16, it is getting it's input from all the bottom 7 pixels.

Now the main advantages of using small receptive fields are two fold. First, there will be less number of parameters as compared to using large receptive fields and other we have included the non linearity well in combinations of those bottom 7 pixels which would not be possible if used large receptive fields.I would recommend you to please check out this awesome link(below) of course CS231 and all these things are beautifully explained there.

Convolutional Neural Networks (CNNs/ConvNets)

Derisible answered 11/5, 2016 at 11:2 Comment(1)

I am not sure if W_j+1 is the number of zeros, and it was not sated as that in the link you posted. Please enlighten me. Thanks. – Theologize 6/8, 2017 at 1:41

I think a good answer has been provided by @Aenimated1. But the link provided by @chirag provides a good way to put the answer. I paste the link here again for any other person coming here :

     [1]: http://cs231n.github.io/convolutional-networks/

And the specific extract that answers the question is :

Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume.

To buttress this answer, I came across this post which might be very useful. It answers whatever doubt one has about receptve field:

https://medium.com/@nikasa1889/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807

Theologize answered 6/8, 2017 at 1:31 Comment(0)

Assume that we have a network architecture that only comprise of multiple convolution layers. For each convolution layer, we define a square kernel size and a dilation rate. Also, Assume that the stride is 1. Thus, you can compute the receptive field of the network by the following piece of python code:

K=[3,3]   # Kernel Size
R=[1,2]  # Dilation Rate

RF=1
d=1 # Depth
for k,r in zip(K,R):
    support=k+(k-1)*(r-1) # r-dilated conv. adds r-1 zeros among coefficients
    RF=support+(RF-1)
    print('depth=%d, K=%d, R=%d, kernel support=%d'%(d,k,r,support))
    d=d+1
print('Receptive Field: %d'%RF)

As an example, let's compute the receptive field (RF) of the well-known DnCNN (denoising convolutional neural network) [1]. Use the above piece of code with the following inputs to compute the RF of that network. (you will get RF=35).

# In DnCNN-S, the network has 17 convolution layers.
K=[3]*17  # Kernel Size
R=[1]*17  # Dilation Rate

[1] Zhang, Kai, et al. "Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising." IEEE Transactions on Image Processing 26.7 (2017): 3142-3155.

Gastroenteritis answered 4/8, 2018 at 5:17 Comment(0)

Consider a 5x5 image (say I) and two scenarios: 3x3 & 5x5 receptive fields.

(3x3 case): First, we extract features in the first layer and get an output image H of 3x3. Then, we do one more convolution and get 1x1 output O.

(5x5 case): We just do one convolution instead of 2 and get an output image O.

So, effectively, we're getting the same result in both cases, just by doing more steps in the case with a smaller filter size 3x3.

Note: You could argue that the network "loses" some information or "freedom" as the number of parameters declines (18 parameters in the first case, and 25 in the second), so how can we say that 3x3 completely "covers" 5x5.

Goddord answered 1/11, 2021 at 3:31 Comment(0)

Recommended topics

Hot tags