Interpretation of in_channels and out_channels in Conv2D in Pytorch Convolution Neural Networks (CNN)

Asked 9/4, 2020 at 7:52 Answered 18/5, 2021 at 11:37

Solved python-3.x deep-learning pytorch conv-neural-network

Let us suppose I have a CNN with starting 2 layers as:

inp_conv = Conv2D(in_channels=1,out_channels=6,kernel_size=(3,3))

Please correct me if I am wrong but I think what this line if code can be thought as that

There is a single grayscale image coming as input where we have to use 6 different kernals of same size (3,3) to make 6 different feature maps from a single image.

And if I have a second Conv2D layer just after first one as

second_conv_connected_to_inp_conv = Conv2D(in_channels=6,out_channels=12,kernel_size=(3,3))

What does this mean in terms of out_channels? Is there going to be 12 new feature maps for each of the 6 feature maps coming as output from first layer OR are there going to be a total of 12 feature maps from 6 incoming features?

Nucleolus answered 9/4, 2020 at 7:52 Comment(5)

The second option. There will be 12 output channels total and 12 kernels, each of which is shape [6,3,3]. What we call a "2d" conv in CNNs can actually be thought of more as a 3d conv but the kernel spans the entire (input) channels dimension and slides along the two spatial dimensions. – Blighter 9/4, 2020 at 7:58

@Blighter Can you please explain this to me in some intuitive terms? For example , if we have output from a Convolution layer as [Batch, 64,64,128] and we again use 64 kernals of size 3x3, How does this kernel transforms the feature maps? I mean with 1 channel , I can understand that 6 different kernels produces 6 different feature maps but with the shape as above, how do you intend to do that? – Nucleolus 11/1, 2021 at 13:18

In that case you have a 64 channel input feature map of spatial shape 64x128. For each output channel you have a kernel of shape 64x3x3 since each kernel spans all the input channels. Without padding, each channel of the output feature map would be 62 x 126. The upper left value in each output feature map can be computed by first centering the kernel at spatial location (1, 1) over the entire 64 channel input feature map, spanning all the channels, then elementwise multiplying the 64x3x3 kernel with the overlapping part of the input feature map and summing. – Blighter 11/1, 2021 at 13:59

The next output would be computed by doing the same except the kernel would be centered at location (1, 2), then (1, 3), etc... until all the entries in the output feature map channel are filled. Since you have a 64x3x3 kernel for each output channel you would repeat this entire computation for each kernel to fill in each channel of the output feature map. – Blighter 11/1, 2021 at 14:2

@Blighter oh sorry! I used channels last. But anyways, things are same. I just want to know that if a kernel of depth 64 spans all across 128 feature maps (let us suppose the channel last) , then won’t the new feature maps should be having 64 features? This is want to know in terms of 3D. For 1 and 2D, it’s just sliding window. But with 3D, I can’t visualise how things are being done – Nucleolus 12/1, 2021 at 5:8

For each out_channel, you have a set of kernels for each in_channel.
Equivalently, each out_channel has an in_channel x height x width kernel:

for i in nn.Conv2d(in_channels=2, out_channels=3, kernel_size=(4, 5)).parameters():    
    print(i)

Output:

Parameter containing:
tensor([[[[-0.0012,  0.0848, -0.1301, -0.1164, -0.0609],
      [ 0.0424, -0.0031,  0.1254, -0.0140,  0.0418],
      [-0.0478, -0.0311, -0.1511, -0.1047, -0.0652],
      [ 0.0059,  0.0625,  0.0949, -0.1072, -0.0689]],

     [[ 0.0574,  0.1313, -0.0325,  0.1183, -0.0255],
      [ 0.0167,  0.1432, -0.1467, -0.0995, -0.0400],
      [-0.0616,  0.1366, -0.1025, -0.0728, -0.1105],
      [-0.1481, -0.0923,  0.1359,  0.0706,  0.0766]]],


    [[[ 0.0083, -0.0811,  0.0268, -0.1476, -0.1142],
      [-0.0815,  0.0998,  0.0927, -0.0701, -0.0057],
      [ 0.1011,  0.1572,  0.0628,  0.0214,  0.1060],
      [-0.0931,  0.0295, -0.1226, -0.1096, -0.0817]],

     [[ 0.0715,  0.0636, -0.0937,  0.0478,  0.0868],
      [-0.0200,  0.0060,  0.0366,  0.0981,  0.1518],
      [-0.1218, -0.0579,  0.0621,  0.1310,  0.1376],
      [ 0.1395,  0.0315, -0.1375,  0.0145, -0.0989]]],


    [[[-0.1474,  0.1405,  0.1202, -0.1577,  0.0296],
      [-0.0266, -0.0260, -0.0724,  0.0608, -0.0937],
      [ 0.0580,  0.0800,  0.1132,  0.0591, -0.1565],
      [-0.1026,  0.0789,  0.0331, -0.1233, -0.0910]],

     [[ 0.1487,  0.1065, -0.0689, -0.0398, -0.1506],
      [-0.0028, -0.1191, -0.1220, -0.0087,  0.0237],
      [-0.0648,  0.0938, -0.0962,  0.1435,  0.1084],
      [-0.1333, -0.0394,  0.0071,  0.0231,  0.0375]]]], requires_grad=True)
Parameter containing:
tensor([ 0.0620,  0.0095, -0.0771], requires_grad=True)

A more detailed example going from 1 channel input, through 2 and 4 channel convolutions:

import torch

torch.manual_seed(0)

input0 = torch.randint(-1, 1, (1, 1, 8, 8)).type(torch.FloatTensor)
print('input0:', input0.size())
print(input0.data)

layer0 = nn.Conv2d(in_channels=1, out_channels=2, kernel_size=2, stride=2, padding=0, bias=False)
print('\nlayer1:')
for i in layer0.parameters():
    print(i.size())
    i.data = torch.randint(-1, 1, i.size()).type(torch.FloatTensor)
    print(i.data)

output0 = layer0(input0)
print('\noutput0:', output0.size())
print(output0.data)

print('\nlayer1:')
layer1 = nn.Conv2d(in_channels=2, out_channels=4, kernel_size=2, stride=2, padding=0, bias=False)
for i in layer1.parameters():
    print(i.size())
    i.data = torch.randint(-1, 1, i.size()).type(torch.FloatTensor)
    print(i.data)
output1 = layer1(output0)
print('\noutput1:', output1.size())
print(output1.data)

output:

input0: torch.Size([1, 1, 8, 8])
tensor([[[[-1.,  0.,  0., -1.,  0.,  0.,  0.,  0.],
  [ 0.,  0.,  0., -1., -1.,  0., -1., -1.],
  [-1., -1., -1.,  0., -1.,  0.,  0., -1.],
  [-1.,  0.,  0.,  0.,  0., -1.,  0., -1.],
  [ 0., -1.,  0.,  0., -1.,  0.,  0., -1.],
  [-1.,  0., -1.,  0.,  0.,  0.,  0.,  0.],
  [-1.,  0., -1.,  0.,  0.,  0.,  0., -1.],
  [ 0., -1., -1.,  0.,  0., -1.,  0., -1.]]]])

layer1:
torch.Size([2, 1, 2, 2])
tensor([[[[-1., -1.],
          [-1.,  0.]]],

        [[[ 0., -1.],
          [ 0., -1.]]]])

output0: torch.Size([1, 2, 4, 4])
tensor([[[[1., 1., 1., 1.],
          [3., 1., 1., 1.],
          [2., 1., 1., 1.],
          [1., 2., 0., 1.]],

         [[0., 2., 0., 1.],
          [1., 0., 1., 2.],
          [1., 0., 0., 1.],
          [1., 0., 1., 2.]]]])

layer1:
torch.Size([4, 2, 2, 2])
tensor([[[[-1., -1.],
          [-1., -1.]],

         [[ 0., -1.],
          [ 0., -1.]]],


        [[[ 0.,  0.],
          [ 0.,  0.]],

         [[ 0., -1.],
          [ 0.,  0.]]],


        [[[ 0.,  0.],
          [-1.,  0.]],

         [[ 0., -1.],
          [-1.,  0.]]],


        [[[-1., -1.],
          [-1., -1.]],

         [[ 0.,  0.],
          [-1., -1.]]]])

output1: torch.Size([1, 4, 2, 2])
tensor([[[[-8., -7.],
          [-6., -6.]],

         [[-2., -1.],
          [ 0., -1.]],

         [[-6., -3.],
          [-2., -2.]],

         [[-7., -7.],
          [-7., -6.]]]])

Breaking down the linear algebra:

np.sum(
    # kernel for layer1, in_channel 0, out_channel 0
    # multiplied by output0, channel 0, top left corner
    (np.array([[-1., -1.],
              [-1., -1.]]) * \
    np.array([[1., 1.],
              [3., 1.]])) + \

    # kernel for layer1, in_channel 1, out_channel 0
    # multiplied by output0, channel 1, top left corner
    (np.array([[ 0., -1.],
              [ 0., -1.]]) * \
    np.array([[0., 2.],
              [1., 0.]]))
)

This will be equal to output1, channel 0, top left corner: -8.0

Raffaello answered 18/5, 2021 at 11:37 Comment(3)

Yes, so how does (Batch, Width, Height, 64) becomes (Batch, New_Height, New_width, 128). I just want to know it intutively? For example I could say hypothetically that Each of new 128 filters are applied to each of 64 old filters. so for each new 1 filter, there will be 64 2-D features and if we use some_function assume, Add, all the features, we get a new feature. So Similarly stacking these features, we'll have 128. How is this some_function working? – Nucleolus 18/5, 2021 at 12:20

You will have 128 sets of 64 kernels ('filters'). For each output channel you will sum the products of all of its kernels multiplied by their inputs. I'll add a more detailed example above. – Raffaello 18/5, 2021 at 17:43

Okay! So this is what I was asking exactly.It is the sum function. It'll sum the all the old 64 channels output to provide a new single channel. And if we do it 128 times, we'll have 128 new features. – Nucleolus 18/5, 2021 at 18:21

To increase 6 channels in your second convolution layer to 12 channels. We take 12 of 6x3x3 filters. Each 6x3x3 filter will give a single Channel as output when the dot product is performed. Since we are taking 12 of those 6x3x3 filters we will get exactly 12 channels as output. For more information check this link.

https://cs231n.github.io/convolutional-networks/#conv

Edit: Think of it in this way. we have 6 input channels i.e HxWx6 where H is height and W is the width of the image. Since there are 6 channels we take 6 3x3 filters(Assuming kernel size is 3). After dot product we again get 6 Channels. But Now we add all the resulting 6 channels to get a single channel. This operation is performed 12 times to get 12 Channels.

Recountal answered 9/4, 2020 at 12:13 Comment(1)

Can you please explain this to me in some intuitive terms? For example , if we have output from a Convolution layer as [Batch, 64,64,128] and we again use 64 kernals of size 3x3, How does this kernel transforms the feature maps? I mean with 1 channel , I can understand that 6 different kernels produces 6 different feature maps but with the shape as above, how do you intend to do that? – Nucleolus 11/1, 2021 at 13:17