Using ROIPooling layer with a pretrained ResNet34 model in MxNet-Gluon

Assume I have a Resnet34 pretained model in MXNet and I want to add to it the premade ROIPooling Layer included in the API:

https://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.ROIPooling

If the code for initializing Resnet is the following, how can I add ROIPooling at the last layer of the Resnet features before the classifier?

Actually, how can I utilize the ROIPooling function in my model in general?

How can I incorporate multiple different ROIs in the ROIpooling layer? How should they be stored? How should the data iterator be changed in order to give me the Batch index required by the ROIPooling function ?

Let us assume that I use this along with the VOC 2012 Dataset for the task of action recognition

batch_size = 40
num_classes = 11
init_lr = 0.001
step_epochs = [2]

train_iter, val_iter, num_samples = get_iterators(batch_size,num_classes)
resnet34 = vision.resnet34_v2(pretrained=True, ctx=ctx)

net = vision.resnet34_v2(classes=num_classes)

class ROIPOOLING(gluon.HybridBlock):
    def __init__(self):
        super(ROIPOOLING, self).__init__()

    def hybrid_forward(self, F, x):
        #print(x)
        a = mx.nd.array([[0, 0, 0, 7, 7]]).tile((40,1))
        return F.ROIPooling(x, a, (2,2), 1.0)

net_cl = nn.HybridSequential(prefix='resnetv20')
with net_cl.name_scope():
    for l in xrange(4):
        net_cl.add(resnet34.classifier._children[l])
    net_cl.add(nn.Dense(num_classes,  in_units=resnet34.classifier._children[-1]._in_units))

net.classifier = net_cl
net.classifier[-1].collect_params().initialize(mx.init.Xavier(rnd_type='gaussian', factor_type="in", magnitude=2), ctx=ctx)

net.features = resnet34.features
net.features._children.append(ROIPOOLING())

net.collect_params().reset_ctx(ctx)

ROIPooling layer is typically used for object detection networks such as R-CNN and its variants (Fast R-CNN and Faster R-CNN). The essential part of all these architectures is a component (neural or classical CV) that generates region proposals. These region proposals are basically ROIs that need to be fed into the ROIPooling layer. The output of ROIPooling layer is going to be a batch of tensors, where each tensor represents one cropped area of an image. Each of these tensors are processed independently for classification. For example, in R-CNN, these tensors are crops of the image in RGB, which are then run through a classification network. In Fast R-CNN and Faster R-CNN, tensors are features out of a convolutional network, for example ResNet34.

In your example, whether through a classic computer vision algorithm (as in R-CNN and Fast R-CNN) or using a Region Proposal Network (as in Faster R-CNN), you need to generate some ROIs that are candidates for containing object of interest. Once you have these ROIs for each image in one mini-batch, you then need to combine them into one NDArray of [[batch_index, x1, y1, x2, y2]]. What this dimensioning means is that you can basically have as many ROIs as you want, and for each ROI, you must specify which image in the batch to crop (hence the batch_index) and what coordinates to crop it at (hence the (x1, y1) for top-left-corner and (x2,y2) for bottom-right-corner coordinates).

So based on the above, if you're implementing something similar to R-CNN, you would be passing your images directly into the RoiPooling layer:

class ClassifyObjects(gluon.HybridBlock):
    def __init__(self, num_classes, pooled_size):
        super(ClassifyObjects, self).__init__()
        self.classifier = gluon.model_zoo.vision.resnet34_v2(classes=num_classes)
        self.pooled_size = pooled_size

    def hybrid_forward(self, F, imgs, rois):
        return self.classifier(
            F.ROIPooling(
                imgs, rois, pooled_size=self.pooled_size, spatial_scale=1.0))


# num_classes are 10 categories plus 1 class for "no-object-in-this-box" category
net = ClassifyObjects(num_classes=11, pooled_size=(64, 64))
# Initialize parameters and overload pre-trained weights
net.collect_params().initialize()
pretrained_net = gluon.model_zoo.vision.resnet34_v2(pretrained=True)
net.classifier.features = pretrained_net.features

Now if we send dummy data through the network, you can see that if roi array contains 4 rois, the output is going to contain 4 classification results:

# Dummy forward pass through the network
imgs = x = nd.random.uniform(shape=(2, 3, 128, 128))  # shape is (batch_size, channels, height, width)
rois = nd.array([[0, 10, 10, 100, 100], [0, 20, 20, 120, 120],
                 [1, 15, 15, 110, 110], [1, 25, 25, 128, 128]])
out = net(imgs, rois)
print(out.shape)

Outputs:

(4, 11)

If you want to, however, use ROIPooling with similar to Fast R-CNN or Faster R-CNN model, you need access to the features of the network before they are average pooled. These features are then ROIPooled before being passed up to classification. Here an example where the features are from the pre-trained network, the ROIPooling's pooled_size is 4x4, and a simple GlobalAveragePooling followed by a Dense layer is used for classification after ROIPooling. Note that because the image is max-pooled by a factor of 32 through the ResNet network, spatial_scale is set to 1.0/32 to let the ROIPooling layer automatically compensate the rois for that.

def GetResnetFeatures(resnet):
    resnet.features._children.pop()  # Pop Flatten layer
    resnet.features._children.pop()  # Pop GlobalAveragePooling layer
    return resnet.features


class ClassifyObjects(gluon.HybridBlock):
    def __init__(self, num_classes, pooled_size):
        super(ClassifyObjects, self).__init__()
        # Add a placeholder for features block
        self.features = gluon.nn.HybridSequential()
        # Add a classifier block
        self.classifier = gluon.nn.HybridSequential()
        self.classifier.add(gluon.nn.GlobalAvgPool2D())
        self.classifier.add(gluon.nn.Flatten())
        self.classifier.add(gluon.nn.Dense(num_classes))
        self.pooled_size = pooled_size

    def hybrid_forward(self, F, imgs, rois):
        features = self.features(imgs)
        return self.classifier(
            F.ROIPooling(
                features, rois, pooled_size=self.pooled_size, spatial_scale=1.0/32))


# num_classes are 10 categories plus 1 class for "no-object-in-this-box" category
net = ClassifyObjects(num_classes=11, pooled_size=(4, 4))
# Initialize parameters and overload pre-trained weights
net.collect_params().initialize()
net.features = GetResnetFeatures(gluon.model_zoo.vision.resnet34_v2(pretrained=True))

Now if we send dummy data through the network, you can see that if roi array contains 4 rois, the output is going to contain 4 classification results:

# Dummy forward pass through the network
# shape of each image is (batch_size, channels, height, width)
imgs = x = nd.random.uniform(shape=(2, 3, 128, 128))
# rois is the output of region proposal module of your architecture
# Each ROI entry contains [batch_index, x1, y1, x2, y2]
rois = nd.array([[0, 10, 10, 100, 100], [0, 20, 20, 120, 120],
                 [1, 15, 15, 110, 110], [1, 25, 25, 128, 128]])
out = net(imgs, rois)
print(out.shape)

Outputs:

(4, 11)

Recommended topics

Hot tags