Can Caffe classify pixels of an image directly?

Asked 12/5, 2015 at 18:36 Answered 8/9, 2015 at 8:37

Solved computer-vision conv-neural-network caffe image-segmentation semantic-segmentation

I would like to classify pixels of an image to "is street" or "is not street". I have some training data from the KITTI dataset and I have seen that Caffe has an IMAGE_DATA layer type. The labels are there in form of images of the same size as the input image.

Besides Caffe, my first idea to solve this problem was by giving image patches around the pixel which should get classified (e.g. 20 pixels to the top / left / right / bottom, resulting in 41×41=1681 features per pixel I want to classify.
However, if I could tell caffe how to use the labels without having to create those image patches manually (and the layer type IMAGE_DATA seems to suggest that it is possible) I would prefer that.

Can Caffe classify pixels of an image directly? How would such a prototxt network definition look like? How do I give Caffe the information about the labels?

I guess the input layer would be something like

layers {
  name: "data"
  type: IMAGE_DATA
  top: "data"
  top: "label"
  image_data_param {
    source: "path/to/file_list.txt"
    mean_file: "path/to/imagenet_mean.binaryproto"
    batch_size: 4
    crop_size: 41
    mirror: false
    new_height: 256
    new_width: 256
  }
}

However, I am not sure what crop_size exactly means. Is it really centered? How does caffe deal with the corner pixels? What is new_height and new_width good for?

Burkholder answered 12/5, 2015 at 18:36 Comment(2)

your question is very big in a sense that it touches many subjects. Can you "break" it into smaller questions? one topic per question? you can (and should?) link the questions to give context. – Emmyemmye 13/5, 2015 at 6:5

See also: Question on Google Groups – Burkholder 13/5, 2015 at 10:12

Seems you can try fully convolutional networks for semantic segmentation

Caffe was cited in this paper: https://github.com/BVLC/caffe/wiki/Publications

Also here is the model: https://github.com/BVLC/caffe/wiki/Model-Zoo#fully-convolutional-semantic-segmentation-models-fcn-xs

Also this presentation can be helpfull: http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-pixels.pdf

Dumb answered 8/9, 2015 at 8:37 Comment(4)

That's what we actually did. However, it is not so straight-forward to get this work. – Burkholder 8/9, 2015 at 9:31

And one should also mention that you have to use a fork of Caffe – Burkholder 8/9, 2015 at 9:33

@moose Please, post link to fork. – Dumb 8/9, 2015 at 9:38

github.com/longjon/caffe/tree/future - it is mentioned in github.com/BVLC/caffe/wiki/… – Burkholder 8/9, 2015 at 9:39

Can Caffe classify pixels? in theory I think the answer is Yes. I didn't try it myself, but I don't think there is anything stopping you from doing so.

Inputs:
You need two IMAGE_DATA layers: one that loads the RGB image and another that loads the corresponding label-mask image. Note that if you use convert_imageset utility you cannot shuffle each set independently - you won't be able to match an image to its label-mask.

An IMAGE_DATA layer has two "tops" one for "data" and one for "label" I suggest you set the "label"s of both input layers to the index of the image/label-mask and add a utility layer that verifies that the indices always matches, this will prevent you from training on the wrong label-masks ;)

Example:

layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "data-idx"
  # paramters...
}
layer {
  name: "label-mask"
  type: "ImageData"
  top: "label-mask"
  top: "label-idx"
  # paramters...
}
layer {
  name: "assert-idx"
  type: "EuclideanLoss"
  bottom: "data-idx"
  bottom: "label-idx"
  top: "this-must-always-be-zero"
}

Loss layer:
Now, you can do whatever you like to the input data, but eventually to get pixel-wise labeling you need pixel-wise loss. Therefore, you must have your last layer (before the loss) produce a prediction with the same width and height as the "label-mask" Not all loss layers knows how to handle multiple labels, but "EuclideanLoss" (for example) can, therefore you should have a loss layer something like

layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "prediction" # size on image
  bottom: "label-mask"
  top: "loss"
}

I think "SoftmaxWithLoss" has a newer version that can be used in this scenario, but you'll have to check it our yourself. In that case "prediction" should be of shape 2-by-h-by-w (since you have 2 labels).

Additional notes:
Once you set the input size in the parameters of the "ImageData" you fix the sizes of all blobs of the net. You must set the label size to the same size. You must carefully consider how you are going to deal with images of different shape and sizes.

Emmyemmye answered 13/5, 2015 at 6:23 Comment(6)

I tried to address the main issues raised in your question, regarding the details of the parameters of IMAGE_DATA layer - please ask a different specific question about them. – Emmyemmye 13/5, 2015 at 6:25

Could you explain more specifically why the shape has to be 2-by-h-by-w. As far as I have understood the EuclideanLoss has to have the same dimensions as the label, i.e. if the label is an grayscale image there would be only 1 channel and therefore the prediction had to be of shape 1-by-h-by-w? – Jephthah 11/11, 2016 at 0:20

What would be the num_output in the last convolutional layer or are you using a fully connected layer and reshape the output accordingly? @Emmyemmye @Martin Thoma – Jephthah 15/11, 2016 at 20:36

@thigi if you are using a "Convolution" layer, then num_output should equal the number of labels. If you are using "InnerProduct" param you would have to "Reshape" your prediction to get the proper shape for the loss layer. – Emmyemmye 15/11, 2016 at 20:42

If I use EuclideanLoss the num_output has to be the same as the number of labels as well? Would you reshape after or before the loss layer? @Emmyemmye – Jephthah 15/11, 2016 at 20:52

Have a look at that question if you do not understand what I mean: link @Emmyemmye – Jephthah 15/11, 2016 at 20:59