Why rotation-invariant neural networks are not used in winners of the popular competitions?
Asked Answered
G

5

43

As known, modern most popular CNN (convolutional neural network): VGG/ResNet (FasterRCNN), SSD, Yolo, Yolo v2, DenseBox, DetectNet - are not rotate invariant: Are modern CNN (convolutional neural network) as DetectNet rotate invariant?

Also known, that there are several neural networks with rotate-invariance object detection:

  1. Rotation-Invariant Neoperceptron 2006 (PDF): https://www.researchgate.net/publication/224649475_Rotation-Invariant_Neoperceptron

  2. Learning rotation invariant convolutional filters for texture classification 2016 (PDF): https://arxiv.org/abs/1604.06720

  3. RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection 2016 (PDF): http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Cheng_RIFD-CNN_Rotation-Invariant_and_CVPR_2016_paper.html

  4. Encoded Invariance in Convolutional Neural Networks 2014 (PDF)

  5. Rotation-invariant convolutional neural networks for galaxy morphology prediction (PDF): https://arxiv.org/abs/1503.07077

  6. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images 2016: http://ieeexplore.ieee.org/document/7560644/

We know, that in such image-detection competitions as: IMAGE-NET, MSCOCO, PASCAL VOC - used networks ensembles (simultaneously some neural networks). Or networks ensembles in single net such as ResNet (Residual Networks Behave Like Ensembles of Relatively Shallow Networks)

But are used rotation invariant network ensembles in winners like as MSRA, and if not, then why? Why in ensemble the additional rotation-invariant network does not add accuracy to detect certain objects such as aircraft objects - which images is done at a different angles of rotation?

It can be:

  • aircraft objects which are photographed from the ground enter image description here

  • or ground objects which are photographed from the air enter image description here

Why rotation-invariant neural networks are not used in winners of the popular object-detection competitions?

Gastrula answered 9/12, 2016 at 22:31 Comment(4)
In many competitions people analyze every class and its possible rotations. A picture of a plane in the sky can have every possible rotation, but an horizontal picture of a dog running not. And they generate new training images from the original ones with every possible rotation. Maybe that is more accurate than a rotate invariant algorithm. Another possible explanation is that there are very efficient libraries to run CNNs on GPUs (I don't know if there are efficient libraries on GPUs for rotate invariant neural nets).Semiannual
@Semiannual 1. Yes, rotate-invariant approach can be used only for affine transformation (to detect air-objects from ground, or ground-objects from air), but not for ellastic transformation (to detect animals), and not for rotations about an axis outside shooting plane. But rotate-invariant-CNN can be used in addition to ordinary convolutional network in ensembles. Rotate-invariant-CNN requires much less input images and tunable parameters - and thus learn faster and more accurate (for the most appropriate objects)Gastrula
@Semiannual 2. About GPU. 5.Rotation-invariant convolutional neural networks for galaxy morphology prediction: 7.9 Implementation ... This allowed the use of GPU acceleration without any additional effort... Networks were trained on NVIDIA GeForce GTX 680 cards. arxiv.org/pdf/1503.07077v1.pdf Also may be rotate-invariant cv::SURF_GPU in some way can be used instead of convolution-kernel (matrix).Gastrula
In fact, the rotation-invariant feature is very useful for object detection in aerial images. For example, the newly algorithm [RoI Transformer] (arxiv.org/abs/1812.00155) on DOTA.Purely
D
10

The recent progress in image recognition which was mainly made by changing the approach from a classic feature selection - shallow learning algorithm to no feature selection - deep learning algorithm wasn't only caused by mathematical properties of convolutional neural networks. Yes - of course their ability to capture the same information using smaller number of parameters was partially caused by their shift invariance property but the recent research has shown that this is not a key in understanding their success.

In my opinion the main reason behind this success was developing faster learning algorithms than more mathematically accurate ones and that's why less attention is put on developing another property invariant neural nets.

Of course - rotation invariance is not skipped at all. This is partially made by data augmentation where you put the slightly changed (e.g. rotated or rescaled) image to your dataset - with the same label. As we can read in this fantastic book these two approaches (more structure vs less structure + data augmentation) are more or less equivalent. (Chapter 5.5.3, titled: Invariances)

Danie answered 10/12, 2016 at 13:32 Comment(2)
Yes, I think the rotation-invariant convolutional-kernels has not yet able to be trained as fast as conventional Kernel. However, rotation-invariant kernels requires less number of parameters for learning (1 rotation-invariant kernel instead of 12 different ordinary kernels for every 30-degree angle), and less input images. This should speed up the training.Gastrula
Could you be more specific (eg page number) where Bishop states that the two approaches are more or less equivalent? I searched the book for augment but was unable to find anything.Sanbenito
J
6

I'm also wondering why the community or scholar didn't put much attention on ration invariant CNN as @Alex.

One possible cause, in my opinion, is that many scenarios don't need this property, especially for those popular competitions. Like Rob mentioned, some natural pictures are already taken in a unified horizontal (or vertical) way. For example, in face detection, many works will align the picture to ensure the people are standing on the earth before feeding to any CNN models. To be honest, this is the most cheap and efficient way for this particular task.

However, there does exist some scenarios in real life, needing rotation invariant property. So I come to another guess: this problem is not difficult from those experts (or researchers)' view. At least we can use data augmentation to obtain some rotate invariant.

Lastly, thanks so much for your summarization about the papers. I added one more paper Group Equivariant Convolutional Networks_icml2016_GCNN and its implementation on github by other people.

Jonjona answered 7/8, 2017 at 3:14 Comment(0)
L
6

Object detection is mostly driven by the successes of detection algorithms in world-famous object detection benchmarks like PASCAL-VOC and MS-COCO, which are object centric datasets where most objects are vertical (potted plants, humans, horses, etc.) and thus data augmentation with left-right flips is often sufficient (for all we know data augmentation with rotated images like upside-down flips could even hurt detection performance).
Every year the entire community adopts the base algorithmic structure of the winning solution and build on it (I am exaggerating a bit to prove a point but not so much).

Interestingly other less widely known topics like oriented text detections and oriented vehicle detections in aerial imagery both need rotation invariant features and rotation equivariant detection pipelines (like in both articles from Cheng you mentioned).

If you want to find literature and code in this area you need to dive in these two domains. I can already give you a few pointers like the DOTA challenge for aerial imagery or the ICDAR challenges for oriented text detections.

As @Marcin Mozejko said, CNN are by nature translation invariant and not rotation invariant. It is an open problem how to incorporate perfect rotation invariance the few articles that deal with it have yet to become standards even though some of them seem promising. My personal favorite for detection is the modification of Faster R-CNN recently proposed by Ma.

I hope that this direction of research will be investigated more and more once people will get fed up of MS-COCO and VOC.

What you could try is take a state-of-the-art detector trained on MS-COCO like Faster R-CNN with NASNet from TF detection API and see how it performs wrt rotating the test image, in my opinion it would be far from rotation invariant.

Lim answered 5/7, 2018 at 14:10 Comment(0)
P
1

Rotation invariance is mostly a good thing, but not always. Objects can have different interpretation based on their rotation, eg. if a rotated "1" might be difficult to distinguish from a "7".

Priapism answered 7/9, 2021 at 18:51 Comment(0)
T
0

First, let's acknowledge that introducing rotational invariance requires a static assumption about the distribution of angles. For example, another commenter on this page suggested rotating the kernel with 30-degree steps. That's equivalent to assuming that useful rotations in each layer are uniformly distributed over the rotation angles.

In contrast to that, when the network learns rotated kernels, the network picks a different distribution of angles for each layer. An interesting research question is to find what distribution of rotation angles is implied by learned kernels. In any case, why would such learning flexibility be useful?

I suspect that the assumption of a uniform distribution might not be equally useful across all layers of a network. In the first few convolutional layers (edges and other basic shapes), it's likely true that the rotation angles are uniformly distributed. However, in the deep layers, this assumption might be less valid. If cars are almost always rotated within a small range of angles, then why waste compute and space on unlikely rotations?

However, the network won't learn the right distribution of angles if the training dataset is not sufficiently representative. Note that simply rotating an image (called data augmentation) is not the same as rotating an object relative to other objects in the same image. I suppose it comes down to your expectation of the difference between the training dataset and the unobserved dataset to which the network has to generalize.

Interestingly, the human visual cortex is not fully rotation-invariant at the scale of major face features. See https://en.wikipedia.org/wiki/Thatcher_effect.

Taxidermy answered 1/10, 2022 at 14:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.