Leaving Some Stones Unturned:
Dynamic Feature Prioritization for Activity Detection in Streaming Video
Yu-Chuan Su and Kristen Grauman
The University of Texas at Austin
Current approaches for activity recognition often ignore constraints on computational resources: 1) they rely on extensive feature computation to obtain rich descriptors on all frames, and 2) they assume batch-mode access to the entire test video at once. We propose a new active approach to activity recognition that prioritizes "what to compute when" in order to make timely predictions. The main idea is to learn a policy that dynamically schedules the sequence of features to compute on selected frames of a given test video. In contrast to traditional static feature selection, our approach continually re-prioritizes computation based on the accumulated history of observations and accounts for the transience of those observations in ongoing video. We develop variants to handle both the batch and streaming settings. On two challenging datasets, our method provides significantly better accuracy than alternative techniques for a wide range of computational budgets.
[top]We formulate the problem as a Markov decision process (MDP) and learn the policy using reinforcement learning. To apply reinforcement learning, we define the following components:
We apply standard Q-learning with linear function approximation for the action-value function Q. Please see the paper for details.
[top]We show the quantitative results under streaming and untrimmed detection setting with different video representations. We show the policies learned by the algorithm. Please refer to the paper for experiment details and more results.
Our method performs better under most object detector speed. See the left 2 figures.
The advantage is most significant under low detector speed, or equivalently, low resource budget.
Our method intelligently skips uninformative frames as the feature extraction speed increase. See the 1st figure.
It reaches the ultimate performance by processing less than 40% of the frames. See figure 2–4.
Our method can achieve better accuracy (left-top) as well as reduce the computation cost (bot) under all object detector speed.
Also, our method performs well on "early" detection, (measured by the AMOC curve, right-top). Please see the paper for explanation.
We visualize the recognition episodes under streaming setting. These videos show how the policy operates in test time.
• ADL — Bag-of-Object
• UCF-101 — Bag-of-Object
• UCF-101 — CNN