Learning Spherical Convolution for Fast Features from 360° Imagery
We propose a generic approach that can transfer Convolutional Nerual Networks that has been trained on perspective images to 360° images. Our solution entails a new form of distillation across camera projection models. Compared to current practices for feature extraction on 360° images, spherical convolution benefits efficiency by avoiding performing multiple perspective projections, and it benefits accuracy by adapting kernels to the distortions in equirectangular projection.
[top]Existing strategies for applying off-the-shelf CNNs on 360° images are problematic.
Many works try to learn new CNNs on spherical data. However, they require annotated training data in spherical format and cannot exploit existing datasets and models even for the very same task.
[top]We learn the spherical convolutional network to reproduce the exact outputs of the source model on the perspective projected images while taking the equirectangular projection as input.
Because the distortion in equirectangular projection is location detendent, we untie the kernel weights along the rows. The kernels learn to account for the different distortions they encountered.
We propose a layer-wise training procedure to accelerate learning. By requiring the spherical convolutional network to reproduce all intermediate outputs of the source model, each layer of the network becomes independent and can be trained separately.
To evaluate the method, we apply SphConv to off-the-shelf Faster R-CNN. We train the model on the Pano2Vid 360° video dataset and evaluate on the Pano2Vid and spherical Pascal VOC 2007 datasets.