Tensorflow object detection api SSD model using 'keep_aspect_ratio_resizer'
Asked Answered
E

1

9

I am trying to detect objects in different shaped images (not square). I used faster_rcnn_inception_v2 model and there I can use image resizer which maintains the aspect ratio of the image and the output is satisfactory.

image_resizer {
  keep_aspect_ratio_resizer {
    min_dimension: 100
    max_dimension: 600
  }
}

Now for faster performance, I want to train it using ssd_inception_v2 or ssd_inception_v2 model. The sample configuration uses fixed shape resize as below,

image_resizer {
  fixed_shape_resizer {
    height: 300
    width: 300
  }
}

But the problem is I get a very poor detection result because of that fixed resize. I tried changing it to keep_aspect_ratio_resizer as stated earlier in faster_rcnn_inception_v2. I get the following error,

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[ 0] = [1,100,500,3] vs. shape1 = [1,100,439,3]

How can I make the configuration in SSD models to resize the image maintaining the aspect ratio?

Eva answered 8/1, 2018 at 6:45 Comment(1)
You problem might be that your images have different aspect rations and you use a batch size larger than 1. When the images are resized with aspect ratio they end up with different sizes and thus cannot be batched together.Fredra
D
15

SSD and faster R-CNN work quite differently one from another, so, even though F-RCNN has no such constraint, for SSD you need input images that always have the same size (actually you need the feature map to always have the same size, but the best way to ensure it is with always the same input size). This is because it ends with fully connected layers, for which you need to know the size of the feature maps; whereas for F-RCNN there are only convolutions (which work on any input size) up to the ROI-pooling layer (which only doesnt need a fixed image size).

So you need to use a fixed shape resizer for SSD. In the best case, your data always has the same width/height ratio. In that case, just use a fixed_shape_resizer with the same ratio. Otherwise, you'll have to choose an image size (w, h) yourself more or less arbitrarily (some kind of average of your data would do). You have several options from then on:

  • Just letting TF reshape the input to (w, h) with the resizer, without preprocessing. The problem is that the images will be deformed, which may (or not, depending on your data and the objects you're trying to detect) be a problem.

  • Cropping all the images to have sub-images with the same aspect ratio as (w, h). Problem: you'll lose part of the images or have to do more inferences for each image.

  • Padding all images (with black pixels or random white noise) to get images with the same aspect ratio as (w, h). You'll have to do some coordinate translations on the output bounding boxes (the coordinates you'll get will be in the augmented image, you'll have to translate to initial coordinates by multiplying them by old_size/new_size on both axes). The problem is that some objects will be downsized (relatively to the full image size) more than some others, which may or may not be a problem depending on your data and what you're trying to detect.

Dillman answered 8/1, 2018 at 13:36 Comment(4)
Not sure why fully connected layer has to be fixed sized while conv layer is not. Can't we add an extra layer to match the output? However, I thought it could be fixed by tweaking the config file. Looks like that's, not the case.Eva
A fully connected layer is basicaly a matrix multiplication between the features and a weight matrix. This weight matrix has to be of shape (n_features_input, n_outputs), so n_features_input has to be known when you build the network; whereas a convolution is multiplying the input features by the same (small) matrices at every location in the matrix, so you don't need to know the input size, but the output size will depend on it. It just works differently, you should take courses if you want to know more about thatDillman
The performance will drop due to SSD being not as good as F-RCNN anyway, maybe it's not the reshaping that causes problemDillman
I think that gdelab's comment does make sense. If the SSD is trained using input images at 1:1 ratio, then during the inference, the input is also supposed to be reshaped to 1:1 before fed into the model. So, as long as your input images are not "weirdly shaped", for example, a normal 1280x1024 size, as long as the object is big enough (minimum around 64x64), you should not have problem. But if your input image is for example 1920x800, then resizing it to 1:1 ratio could deform the object too much. But you still have the chance to add additional "ratio" parameter, for example,1:4 and 4:1Hautboy

© 2022 - 2024 — McMap. All rights reserved.