Instead of "region detection + object classification", its "(1)region proposal + (2)classification and localization in two stage detectors.
(1-region proposal) is done by what is called a Region Proposal Network (RPN, for short). RPN is used to decide “where” to look in order to reduce the computational requirements of the overall inference process. The RPN quickly and efficiently scans every location in order to assess whether further processing needs to be carried out in a given region. It does that by outputting k bounding box proposals each with 2 scores representing probability of object or not at each location. In other words, it is used to find up to a predefined number(~2000) of regions (bounding boxes), which may contain objects.
An important problem within object detection is generating a variable-length list of bounding boxes. The variable-length problem is solved in the RPN by using anchors: fixed sized reference bounding boxes which are placed uniformly throughout the original image. Instead of having to detect where objects are, we model the problem into two parts. For every anchor, we ask:
- Does this anchor contain a relevant object?
- How would we adjust this anchor to better fit the relevant object?
After having a list of possible relevant objects and their locations in the original image, it becomes a more straightforward problem to solve. Using the features extracted by the CNN and the bounding boxes with relevant objects, we apply Region of Interest (RoI) Pooling and extract those features which would correspond to the relevant objects into a new tensor.
Next in second stage, R-CNN module uses above information to:
- Classify the content in the bounding box (or discard it, using
“background” as a label).
- Adjust the bounding box coordinates (so it better fits the object).