What is an object detection "head"?
N

1

12

I am currently reading up on SSD Single Shot Detector and there is a term that I am struggling to understand. The term is "head". When I hear this word, I think of the head of the network, as in the beginning.

I looked at the object detection API created by google and I found the "heads" folder with different head types, one for the box encoding and another for the class predictions.

The documentation for the abstract "head" class was not super enlightening:

All the different kinds of prediction heads in different models will inherit from this class. What is in common between all head classes is that they have a predict function that receives features as its first argument.

I guess I understand them on a high level, but I don't have a concrete definition of them. Can someone define a "head" and explain how one can have a "box prediction head" or a "classification head"?

Newt answered 26/10, 2018 at 16:27 Comment(0)
C
28

In some domains, head is a term for the start or the beginning of something. In this domain it's different. In many tasks in computer vision you usually use a "backbone", which is usually pre-trained on ImageNet. This way, the backbone is used as a feature extractor, which gives you a feature map representation of the input. Now that you have such feature map, you need to perform the actual task, such as detection, segmentation, etc. The way you do it is usually by applying a "detection head" on the feature map(s), so it's like a head attached to the backbone.

In the case of object detection, you need two output types: classification confidences and bounding boxes. They can be two different, decoupled heads (e.g. RetinaNet), or a single head which computes both outputs (e.g. SSD). In both cases, you need to point out the exact way to interpret the output. For example, the bounding box regression outputs, are they relative to an anchor? Or maybe relative to the entire image? The classification confidences - do you use softmax on the output to receive the confidences? etc.

Clayton answered 28/10, 2018 at 8:27 Comment(2)
It seems that between backbone and head, there is also a part called "neck". Would you please elaborate a bit on that also? Or point to the origin of the terms "backbone/neck/head". Thanks!Phenol
Very briefly, the model is structured this way: input -> backbone -> neck -> head -> output. For an object detection model, the backbone extract the features (it is usually a portion of a network used for image classification), the neck extract some more elaborate features (e.g. look at feature pyramid network) and the head compute your output.Centrosymmetric

© 2022 - 2024 — McMap. All rights reserved.