How do filters run across an RGB image, in first layer of a CNN?

Asked 21/6, 2020 at 3:8 Answered 23/6, 2020 at 23:29

Solved neural-network conv-neural-network convolution channel vgg-net

I was looking at this printout of layers. I realized, this shows input / output, but nothing about how the RGB channels are dealt with.

If you look at block1_conv1, it says "Conv2D". But if the input is 224 x 224 x 3, then that's not 2D.

By my bigger, broader question, is how are 3 channel inputs treated throughout the course of training a model like this (I think it's VGG16). Are the RGB channels combined (summed, or concatenated) at some point? When and where? Does that require some unique filter for that purpose? Or does the model run across the different channel/color representations separately from end to end?

Avrilavrit answered 21/6, 2020 at 3:8 Comment(0)

The "2D" part of a 2D convolution does not refer to the dimension of the convolution input, nor of dimension of the filter itself, but rather of the space in which the filter is allowed to move (2 directions only). Another way of thinking about this is that each of the RGB channels has its own 2D-array filter separately and the output is added at the end.

does the model run across the different channel/color representations separately from end to end?

Effectively it does this across each channel separately. For example, the first Conv2D layer takes in each of the 3 224x224 layers separately, and then applies different 2D-array filters to each one. But this isn't end-to-end across all model layers, only within the layer during the convolution step.

But, you might ask, there are 64 convolution filters for each channel, so why are there not 3*64 = 192 channels in the Conv2D output for the 3 channels? This prompts your question

Are the RGB channels combined (summed, or concatenated) at some point?

The answer is: yes. After the convolution filter is applied to each layer separately, the values for each of the three channels are added, and then a bias too if you've specified that. See the diagram below (from Dive Into Deep Learning, under CC BY-SA 4.0):

The reason for this (that the separate channel layers are added) is that there aren't actually 3 separate 2D-array filters for each channel; technically there's just 1 3D-array filter which only moves in two directions. You could think of it a bit like a hamburger: there's one 2D-array filter for one channel (the bun), another for the next channel (the lettuce), etc. but all the layers are stacked up and function as a whole, so the filter weights are added all together at once. There's no need to add a special weight or filter for this, since the weights are already present during the convolution step (this would just be multiplying two fitting parameters together, which may as well be one).

Sully answered 23/6, 2020 at 22:36 Comment(3)

(1*1+2*2+4*3+5*4)+(0*0+1*1+3*2+4*3) = 56 – Sully 24/6, 2020 at 4:34

If the same kernel was applied to R, G and B and then summed -- why in this example is a different kernel applied to R than what is applied to G? (For this case I'm assuming the two channels of "Input" in the diagram are called R and G) – Shimmer 12/11, 2021 at 17:44

@Shimmer it depends on whether you're conceiving of the kernel as a single 3D kernel or as two distinct, different 2D kernels. In the latter case - different kernels are applied to R and G, not the same. In the former case - it's a single 3D kernel but different 2D slices are applied to R and G respectively - this is just a different way of thinking about things, these two are the same fundamentally. – Sully 12/11, 2021 at 20:59

2D convolutions?

This is a good resource.

If you look at block1_conv1, it says "Conv2D". But if the input is 224 x 224 x 3, then that's not 2D.

You misunderstand the meaning of 2D convolution. The 2D convolutional moves across your input along the height and width of each channel. It never gets moved in the 3rd dimension. Because it moves in this plane, it is a 2D convoltuion. The actual filter is 3D, yes.

But we effectively have 3 images (red, green and blue channels). How are they used to generate just 1D output?

by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores. Stanford CS231n's website

The outputs of each layer of the filter convolved with its respective input layer is matrix summed. A bias is added to all elements, if you want. You might not. So, if you have 3 input channels and 1 filter, you get X-Y-1. If you have 2 (or 1, or 3, or 4 or 1000) input channels and 15 filters, you just get X-Y-15. X and Y are the output heights which will depend on other parameters of the convolution.

3D convolutions

3D convolutions are different, in that each filter is 3D and moves across different channels, so each weight is used in all channels.

What are 1x1 convolutions then?

Its a trick used to reduce the computational cost of a network, by reducing the number of channels (reducing the dimensionality of the input), without doing any 'real' convolution/ calculations. This is useful because fewer channels means fewer weights need to be calculated for computation that uses this output as input, such as bigger (3x3 or 5x5) convolutions.

Weft answered 23/6, 2020 at 23:29 Comment(0)

2D convolutions?

But we effectively have 3 images (red, green and blue channels). How are they used to generate just 1D output?

3D convolutions

What are 1x1 convolutions then?

Recommended topics

Hot tags