SSD mobilenet model does not detect objects at longer distances
Asked Answered
B

1

6

I have trained an SSD Mobilenet model with custom dataset(Battery). Sample image of the battery is given below and also attached the config file which I used to train the model.

Sample Battery image

When the object is closer to the camera(tested with webcam) it detects the object accurately with probability over 0.95 but when I move the object to a longer distance it is not getting detected. Upon debugging, Found that the object gets detected but with the lower probability 0.35. The minimum threshold is set to 0.5. If I change the threshold 0.5 to 0.2, object is getting detected but there are more false detections.

Referring to this link, SSD does not perform very well for small objects and an alternate solution is to use FasterRCNN, but this model is very slow in real-time. I would like the battery to be detected from longer distance too using SSD.

Please help me with the following

  1. If we want to detect longer distance objects with higher probability, do we need to change the aspect ratios and scale params in the config?
  2. If we want to aspect ratios, how to choose those values with respective to the object?
Botelho answered 10/5, 2019 at 6:10 Comment(0)
E
13

Changing aspect ratios and scales won't help improve the detection accuracy of small objects (since the original scale is already small enough, e.g. min_scale = 0.2). The most important parameter you need to change is feature_map_layout. feature_map_layout determines the number of feature maps (and their sizes) and their corresponding depth (channels). But sadly this parameter cannot be configured in the pipeline_config file, you will have to modify it directly in the feature extractor.

Here is why this feature_map_layout is important in detecting small objects. enter image description here

In the above figure, (b) and (c) are two feature maps of different layouts. The dog in the groundtruth image matches the red anchor box on the 4x4 feature map, while the cat matches the blue one on the 8x8 feature map. Now if the object you want to detect is the cat's ear, then there would be no anchor boxes to match the object. So the intuition is: If no anchor boxes match an object, then the object simply won't be detected. To successfully detect the cat's ear, what you need is probably a 16x16 feature map.

Here is how you can make the change to feature_map_layout. This parameter is configured in each specific feature extractor implementation. Suppose you use ssd_mobilenet_v1_feature_extractor, then you can find it in this file.

feature_map_layout = {
    'from_layer': ['Conv2d_11_pointwise', 'Conv2d_13_pointwise', '', '',
                   '', ''],
    'layer_depth': [-1, -1, 512, 256, 256, 128],
    'use_explicit_padding': self._use_explicit_padding,
    'use_depthwise': self._use_depthwise,
}

Here the there are 6 feature maps of different scales. The first two layers are taken directly from mobilenet layers (hence the depth are both -1) while the rest four result from extra convolutional operations. It can be seen that the lowest level feature map comes from the layer Conv2d_11_pointwise of mobilenet. Generally the lower the layer, the finer the feature map features, and the better for detecting small objects. So you can change this Conv2d_11_pointwise to Conv2d_5_pointwise (why this? It can be found from the tensorflow graph, this layer has bigger feature map than layer Conv2d_11_pointwise), it should help detect smaller objects.

But better accuracy comes at extra cost, the extra cost here is the detect speed will drop a little because there are more anchor boxes to take care of. (Bigger feature maps). Also since we choose Conv2d_5_pointwise over Conv2d_11_pointwise, we lose the detection power of Conv2d_11_pointwise.

If you don't want to change the layer but simply add an extra feature map, e.g. making it 7 feature maps in total, you will have to change num_layers int the config file to 7 too. You can think of this parameter as the resolution of the detection network, the more lower level layers, the finer the resolution will be.

Now if you have performed above operations, one more thing to help is to add more images with small objects. If this is not feasible, at least you can try adding data augmentation operations like random_image_scale

Erin answered 10/5, 2019 at 11:34 Comment(4)
@danyfang Thanks for your reply..If I do the changes in the feature map layout, will I be able to detect the cat at longer distance because I don't want to detect cat's ear(as you mentioned)?Botelho
Just want to show an extreme example as if the cat at longer distance is about the size of a ear (of a cat at closer distance).Erin
@danyfang do we need to change the min_scale and max_scale while changing the layer name? Does the scale values have any impact on changing the layer name?Botelho
@danyfang, thank you for your answer, I had a similar problem as in the OP. However, I tried to follow the procedure you described and it doesn't work. I added a Conv2d_5_pointwise in ssd_mobilenet_v1_feature_extractor.py (and the corresponding -1 field in layer_depth), and changed num_layers into the config file to 7 too, but I still get an error message stating that "ValueError: Number of feature maps is expected to equal the length of "num_anchors_per_location". Should I modify anything else to make it work?Gann

© 2022 - 2024 — McMap. All rights reserved.