Pano2Vid: Automatic Cinematography for Watching 360° Videos

A 360° camera captures the entire visual world from its optical center, which provides exciting new ways to record and experience visual content by relieving restrictions on the field-of-view (FOV). Videographers no longer have to determine what to capture in the scene, and human viewers can freely explore the visual content. On the other hand, it also introduces new challenges for the video viewer. The video viewer has to decide “where and what” to look at by controlling the viewing direction throughout the full duration of the video. Because the viewer has no information about the content beyond the current FOV, it may be difficult to find interesting content and determine where to look in real time.

To address this difficulty, we define “Pano2Vid”, a new computer vision problem. The task is to design an algorithm that automatically controls the pose and motion of a virtual normal-field-of-view (NFOV) camera within an input 360° video. Camera control must be optimized to produce video that could conceivably have been captured by a human observer equipped with a real NFOV camera. A successful Pano2Vid solution would therefore take the burden of choosing “where to look” off both the videographer and the end viewer: the videographer could enjoy the moment without consciously directing her camera, while the end viewer could watch intelligently-chosen portions of the video in the familiar NFOV format.

[top]

Motivation

When watching 360° videos, the human viewer needs to actively control the viewing direction. This is not a trivial task, because the viewer has no information beyond the current field of view. For example, in the above video, the viewer fails to notice that there is an elephant approaching the camera from the opposite direction at the beginning. Therefore, the key challenge for watching 360° video is how to find the right direction to watch.

[top]

The Pano2Vid Problem

To overcome the challenge of viewing 360° video, we propose a new computer vision problem that helps people determine where and what to look at in 360° video.

Input: 360° video
Output: normal-field-of-view video
Task: controls the virtual camera direction (in 2D spherical direction)
Goal: generate videos look as if captured by human videographer in the scene

[top]

AutoCam

Spatio-temporal Glimpse

A spatio-temporal glimpse is a short normal-field-of-view video extracted from 360° video with fixed viewing direction. It transforms 360° content into normal video and makes the visual feature comparable.

Capture-worthiness

We define capture-worthiness as how much a spatio-temporal glimpse looks like human-captured normal-field-of-view videos ("HumanCam").

Sample Spatio-temporal Glimpse

Given a 360° video, we densely sample spatio-temporal glimpses both spatially and temporally. We sample 198 glimpses at 18 azimutal angles and 11 polar angles every 5 seconds. We then estimate the capture-worthiness score for all the glimpses.

Construct Virtual Camera Trajectory

We transform the problem of controlling viewing direction into selecting one spatio-temporal glimpse each moment in the video. We find a path over the spatio-temporal glimpses that maximize the accumulated capture-worthiness score while obeying a smooth camera motion constraint, which forbids the virtual camera from performing abrupt motion. The problem can be reduced to a shortest path problem and solved by dynamic programming.

[top]

Experiment

Dataset

We collect the 360° and HumanCam videos from YouTube using following keywords: "Hiking", "Mountain Climbing", "Parade", "Soccer".

	# Videos	Total Length
360° Videos	86	7.3 hours
HumanCam	9,171	343 hours

Baselines

Center prior – random trajectories biased toward center in 360° camera axis
Eye-level prior – static trajectories lying on the equator
Saliency – replace capture-worthiness score with saliency score in AutoCam

Evaluation Metrics

HumanCam-based Metrics – whether the algorithm generated videos look like HumanCam videos

Distinguishability – are algorithm generated and HumanCam videos distinguishable?
HumanCam-Likeness – which algorithm generates videos that are closer to HumanCam videos?
Transferability – do semantic classifiers transfer between algorithm generated and HumanCam videos?

HumanEdit-based Metrics – whether the algorithm controls the viewing direction similar to human viewers in the same 360° video

Cosine – cosine similarity between the viewing directions of human viewer and algorithm output in the same 360° video
Overlap – field-of-view overlap of human viewer and algorithm output in the same 360° video

Results

[top]

Video Examples

AutoCam Outputs

Comparison with Baseline

These two examples show why center and eye-level heuristics are reasonable baselines to compared with.

Center – Videographers often hold the 360° camera in an orientation such that the center corresponds to some special directions, e.g. the direction pointing to the videographer.
Eye-level – Most events appear near the horizon.

Nevertheless, they were unable to adapt to the content and often fail to achieve good framing.

Failure Cases

The first example shows two problems in the AutoCam algorithm:

The virtual camera has limited FOV and cannot capture the entire subject at once
The camera motion is restricted by the smooth camera motion constraint and can't turn to the subject promptly

The second example shows that the capture-worthiness score does not encode the preference over different contents that look like HumanCam videos.

HumanEdit Interface

Design highlights:

Display 360° video in equirectangular projection, so the editor can see all content at once
Expand the panoramic strip by 90° on both side to avoid discontinuous content
Human editor controls the virtual camera direction by the mouse location
Backproject the camera field-of-view to the 360° video

[top]

Publication

Yu-Chuan Su, Dinesh Jayaraman, and Kristen Grauman, "Pano2Vid: Automatic Cinematography for Watching 360° Videos," ACCV 2016 (Oral, Best Application Paper Award)
[pdf] [supp] [slide] [poster] [bib] [code] [360° videos/HumanCam] [HumanEdit] [HumanCam C3D (fc6)]

[top]