Pano2Vid: Automatic Cinematography for Watching 360° Videos
A 360° camera captures the entire visual world from its optical center, which provides exciting new ways to record and experience visual content by relieving restrictions on the field-of-view (FOV). Videographers no longer have to determine what to capture in the scene, and human viewers can freely explore the visual content. On the other hand, it also introduces new challenges for the video viewer. The video viewer has to decide “where and what” to look at by controlling the viewing direction throughout the full duration of the video. Because the viewer has no information about the content beyond the current FOV, it may be difficult to find interesting content and determine where to look in real time.
To address this difficulty, we define “Pano2Vid”, a new computer vision problem. The task is to design an algorithm that automatically controls the pose and motion of a virtual normal-field-of-view (NFOV) camera within an input 360° video. Camera control must be optimized to produce video that could conceivably have been captured by a human observer equipped with a real NFOV camera. A successful Pano2Vid solution would therefore take the burden of choosing “where to look” off both the videographer and the end viewer: the videographer could enjoy the moment without consciously directing her camera, while the end viewer could watch intelligently-chosen portions of the video in the familiar NFOV format.
[top]When watching 360° videos, the human viewer needs to actively control the viewing direction. This is not a trivial task, because the viewer has no information beyond the current field of view. For example, in the above video, the viewer fails to notice that there is an elephant approaching the camera from the opposite direction at the beginning. Therefore, the key challenge for watching 360° video is how to find the right direction to watch.
[top]To overcome the challenge of viewing 360° video, we propose a new computer vision problem that helps people determine where and what to look at in 360° video.
A spatio-temporal glimpse is a short normal-field-of-view video extracted from 360° video with fixed viewing direction. It transforms 360° content into normal video and makes the visual feature comparable.
We define capture-worthiness as how much a spatio-temporal glimpse looks like human-captured normal-field-of-view videos ("HumanCam").
Given a 360° video, we densely sample spatio-temporal glimpses both spatially and temporally. We sample 198 glimpses at 18 azimutal angles and 11 polar angles every 5 seconds. We then estimate the capture-worthiness score for all the glimpses.
We transform the problem of controlling viewing direction into selecting one spatio-temporal glimpse each moment in the video. We find a path over the spatio-temporal glimpses that maximize the accumulated capture-worthiness score while obeying a smooth camera motion constraint, which forbids the virtual camera from performing abrupt motion. The problem can be reduced to a shortest path problem and solved by dynamic programming.
[top]We collect the 360° and HumanCam videos from YouTube using following keywords: "Hiking", "Mountain Climbing", "Parade", "Soccer".
# Videos | Total Length | |
---|---|---|
360° Videos | 86 | 7.3 hours |
HumanCam | 9,171 | 343 hours |
These two examples show why center and eye-level heuristics are reasonable baselines to compared with.
Nevertheless, they were unable to adapt to the content and often fail to achieve good framing.
The first example shows two problems in the AutoCam algorithm:
The second example shows that the capture-worthiness score does not encode the preference over different contents that look like HumanCam videos.
Design highlights: