I was reading this from a paper: "Rather than using relatively large receptive fields in the first conv. layers we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field."
How do they end up with a recpetive field of 7x7 ?
This is how i understand it: Suppose that we have one image that is 100x100.
1st layer: zero-pad the image and convole it with the 3x3 filter, output another 100x100 filtered image.
2nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output another 100x100 filtered image.
3nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output the final 100x100 filtered image.
What am I missing there ?