Leaving Some Stones Unturned:

Dynamic Feature Prioritization for Activity Detection in Streaming Video

Yu-Chuan Su and Kristen Grauman
The University of Texas at Austin

Abstract

Current approaches for activity recognition often ignore constraints on computational resources: 1) they rely on extensive feature computation to obtain rich descriptors on all frames, and 2) they assume batch-mode access to the entire test video at once. We propose a new active approach to activity recognition that prioritizes "what to compute when" in order to make timely predictions. The main idea is to learn a policy that dynamically schedules the sequence of features to compute on selected frames of a given test video. In contrast to traditional static feature selection, our approach continually re-prioritizes computation based on the accumulated history of observations and accounts for the transience of those observations in ongoing video. We develop variants to handle both the batch and streaming settings. On two challenging datasets, our method provides significantly better accuracy than alternative techniques for a wide range of computational budgets.

[top]

Approach

We formulate the problem as a Markov decision process (MDP) and learn the policy using reinforcement learning. To apply reinforcement learning, we define the following components:

State s_k
captures the video content observed so far at the k-th step of the recognition episode. It is defined in terms of the history of extracted features and prior actions.
Action A={a_m}
a set of discrete actions the system can perform at each step in the episode, which lead to an update of the state. An action either extracts information from the video or moves to the next frame in the stream.
Reward r_k=R(s_k, a^(k), s_k+1)
the reward received when transits from state s_k to s_k+1 after taking action a^(k). We define the reward as the increase of the predicted probability of the correct activity class.

We apply standard Q-learning with linear function approximation for the action-value function Q. Please see the paper for details.

[top]

Experiment Results

We show the quantitative results under streaming and untrimmed detection setting with different video representations. We show the policies learned by the algorithm. Please refer to the paper for experiment details and more results.

Streaming + Bag-of-Object

Our method performs better under most object detector speed. See the left 2 figures.
The advantage is most significant under low detector speed, or equivalently, low resource budget.
Streaming + CNN & IDT (UCF-101)

Our method intelligently skips uninformative frames as the feature extraction speed increase. See the 1st figure.
It reaches the ultimate performance by processing less than 40% of the frames. See figure 2–4.
Untrimmed + Bag-of-Object

Our method can achieve better accuracy (left-top) as well as reduce the computation cost (bot) under all object detector speed.
Also, our method performs well on "early" detection, (measured by the AMOC curve, right-top). Please see the paper for explanation.
Learned policy

[top]

Example Recognition Episodes

We visualize the recognition episodes under streaming setting. These videos show how the policy operates in test time.

• ADL — Bag-of-Object

• UCF-101 — Bag-of-Object

• UCF-101 — CNN

[top]

Publication

Yu-Chuan Su and Kristen Grauman, "Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video," ECCV 2016
[pdf] [supp] [poster] [bib]

[top]