Scale layer in Caffe

Asked 24/5, 2016 at 10:31 Answered 26/7, 2016 at 20:7

Solved neural-network deep-learning caffe conv-neural-network resnet

I am looking through the Caffe prototxt for deep residual networks and have noticed the appearance of a "Scale" layer.

layer {
    bottom: "res2b_branch2b"
    top: "res2b_branch2b"
    name: "scale2b_branch2b"
    type: "Scale"
    scale_param {
        bias_term: true
    }
}

However, this layer is not available in the Caffe layer catalogue. Can someone explain the functionality of this layer and the meaning of the parameters or point to a an up-to-date documentation for Caffe?

Steger answered 24/5, 2016 at 10:31 Comment(0)

You can find a detailed documentation on caffe here.

Specifically, for "Scale" layer the doc reads:

Computes a product of two input Blobs, with the shape of the latter Blob "broadcast" to match the shape of the former. Equivalent to tiling the latter Blob, then computing the elementwise product.
The second input may be omitted, in which case it's learned as a parameter of the layer.

It seems like, in your case, (single "bottom"), this layer learns a scale factor to multiply "res2b_branch2b". Moreover, since scale_param { bias_term: true } means the layer learns not only a multiplicative scaling factor, but also a constant term. So, the forward pass computes:

res2b_branch2b <- res2b_branch2b * \alpha + \beta

During training the net tries to learn the values of \alpha and \beta.

Lynxeyed answered 24/5, 2016 at 11:27 Comment(5)

Shai, Then is it equivalent to putting one convolutional layer with filter size 1x1 after res2b_branch2b. If we do this also the output will be y = W*x + b, and it will learn the W and b right? So is this equivalent to Scale layer when we don't provide the latter bottom layer? – Marinamarinade 18/7, 2017 at 4:19

@Marinamarinade it is only equivalent if x is 1D. Do not confuse inner product and scalar multiplication. – Lynxeyed 18/7, 2017 at 5:8

oh ok. Then it only learns two parameters alpha and beta, instead of whole W matrix in this case. am I right? – Marinamarinade 18/7, 2017 at 6:15

@Marinamarinade yes, only a scalar \alpha – Lynxeyed 18/7, 2017 at 7:52

If you are doing this in the Wolfram Language, Scale is equivalent to ConstantTimesLayer[] followed by ConstantPlusLayer[]. – Skydive 15/4, 2018 at 1:3

There's also some documentation on it in the caffe.proto file, you can search for 'ScaleParameter'.

Thanks a heap for your post :) Scale layer was exactly what I was looking for. In case anyone wants an example for a layer that scales by a scalar (0.5) and then "adds" -2 (and those values shouldn't change):

layer {
  name: "scaleAndAdd"
  type: "Scale"
  bottom: "bot"
  top: "scaled"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  scale_param {
    filler {
      value: 0.5    }
    bias_term: true
    bias_filler {
      value: -2
    }
  }
}

(Probably, the decay_mult's are unnecessary here though. But dunno. See comments.) Other than that:

lr_mult: 0 - switches off learning for "that param" - I think the first "param {" always(?) refers to the weights, the second to bias (lr_mult is not ScaleLayer specific)
filler: a "FillerParameter" [see caffe.proto] telling how to fill the ommited second blob. Default is one constant "value: ...".
bias_filler: parameter telling how to fill an optional bias blob
bias_term: whether there is a bias blob

All taken from caffe.proto. And: I only tested the layer above with both filler values = 1.2.

Beeves answered 26/7, 2016 at 20:7 Comment(2)

dasWesen. You have already provided the weights and bias term. so you put lr_mult=0 so that you don't need to learn them and decay_mult = 0 so that you don't need to penalize the weights as well. – Marinamarinade 18/7, 2017 at 4:24

@Dharma: Wait, I was in the middle of editing already, but: I think decay_mult's are not necessary after all. At least such additional regularization terms won't change the direction of the (loss-) gradient. That's because these terms are constants, because the variables involved (=scale and bias) are not allowed to change. -- But it might run faster with decay_mult=0. – Beeves 22/7, 2017 at 15:18

Recommended topics

Hot tags