Detecting Engagement in Egocentric Video

Yu-Chuan Su and Kristen Grauman
The University of Texas at Austin

Imagine you are walking through a grocery store. You may be mindlessly plowing through the aisles grabbing your usual food staples, when a new product display — or an interesting fellow shopper — captures your interest for a few moments. Similarly, in the museum, as you wander the exhibits, occasionally your attention is heightened and you draw near to examine something more closely.

These examples illustrate the notion of engagement in ego-centric activity, where one pauses to inspect something more closely. Knowing when engagement is heightened would benefit various applications in video summarization and augmented reality, yet prior work focuses solely on what one is looking at (estimating saliency, gaze) without considering when one is engaged with the engironment. We are the first to introduce the problem of engagement prediction and introduce a large, richly annotated dataset for ego-engagement. Our results show that engagement can be detected well independent of both scene appearance and the camera wearer's identity.

[top]

Egocentric Engagement

Definition

We define heightened ego-engagement in a browsing scenario as follows. A time interval is considered to have a high engagement level if the recorder is attracted by some object(s), and he interrupts his ongoing flow of activity to purposefully gather more information about the object(s).

Data collection

We ask 9 recorders to take videos under the following three browsing scenarios:

Shopping in the market
Window shopping in shopping mall
Touring in a museum

Overall, we collect 27 videos, each averaging 31 minutes, for a total dataset of 14 hours. Please contact the author if you are interested in and want access to the dataset.

We also collect the annotation for ego-engagement from Amazon Mechanical Turks. We ask 10 Turkers to annotate each interval and define high engagement intervals as those where a majority marked as positive. See the annotation interface for the instructions for Turkers. Some high engagement intervals are shown as follows.

Data analysis

We analyze the consistency of annotations (vs. Consensus) in the following table. The results show that annotators have reasonable agreement on the rough interval locations, which verifies the soundness of our definition. We also verify how well the third-party labels match the experience of first-person recorder (vs. Recorder). Overall, the 0.813 F₁ score indicates our label are fairly faithful to individuals' subjective interpretation.

[top]

Approach

(1) Compute frame-wise motion descriptor

Divide the frame into 16x12 uniform cells and compute the optical flow in each cell as the descriptor. The grid motion are smoothed temporally with a Gaussian kernel to integrate out the unstable head bobbles.

(2) Estimate frame-wise engagement level

Use the frame-level ground trouth to train an i.i.d. classifier for frame-wise engagement.

(3) Generate interval hypothesis

Use level set method to generate interval hypothesis from the frame-wise engagement estimation.

(4) Compute interval motion by Temporal Pyramid

Aggregate the frame-wise motion descriptors within a hypothesis using temporal pyramid. The temporal pyramid descriptor captures both the motion distribution and evolution over time.

(5) Estimate interval engagement & select candidates

Use the interval-level ground trouth and descriptor to learn a engagement classifier. At test time, if a frame is covered by multiple proposals, the highest confidence score is taken as the final prediction per frame.

[top]

Result

Quantitative

Cross-Recorder: Train a predictor for each recorder using exclusively video from other recorders.
Cross-Scenario: Train a predictor for each scenario using exclusively video from other scenarios.
Cross Recorder AND Scenario: Disallow any overlap in either the recorder or the scenario in train-test data split.

Qualitative

We show the predicted intervals of the following methods:

Ground truth

Ours - interval

CNN Appearance

Motion Magnitude (Rallapalli 2014)

Video Saliency (Rudoy 2013)

Methods 1 and 2 are trained using cross-recorder strategy. The video clips are from the UT Egocentric Engagement dataset we collected.

The recorder walks to the refrigerator and grabs the cabbage. The recorder continuously moves around during the interval, and motion magnitude baseline fails.

The recorder searches for items on the shelf and looks up and down. Appearance baseline fails when he looks at objects from non-common views.

The recorder searches for itmes on the shelf. Appearance baseline fails when the recorder looks up.

The recorder looks at the corner of the paining from a large angle. Appearance baseline fails because the recorder looks at the object from non-common views.

The recorder looks up at the sign. Appearance baseline fails when the recorder looks at objects that rarely appear in the dataset.

The recorder walks around and looks at the statue from different views. Appearance baseline fails because the statue appear only once in the dataset.

The recorder looks at a series of exhibitions. He performs multiple actions and looks at multiple items in the interval. Baseline methods fail to generate stable and continuous prediction.

The recorder walks around the shelf and inspect the goods. Baseline methods fail to generate stable and continuous prediction.

The recorder is walking in the aisle of the shopping mall and looks around. He performs actions similar to those he performs when engaged on objects. These actions trigger false positive of our motion-based method.

The object that attracts the recorder is on his walking direction. The recorder doesn't perform any action in response, so our motion-based method fails.

[top]

Publication

Yu-Chuan Su and Kristen Grauman, "Detecting Engagement in Egocentric Video," ECCV 2016 (Oral)
[pdf] [supp] [slide] [poster] [bib] [data]

[top]