Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

06/08/2023
by   Zihui Xue, et al.
0

The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to learning view-invariant features from paired synchronized viewpoints. We relax that strong data assumption and propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time, even when not captured simultaneously or in the same environment. To this end, we propose AE2, a self-supervised embedding approach with two key designs: (1) an object-centric encoder that explicitly focuses on regions corresponding to hands and active objects; (2) a contrastive-based alignment objective that leverages temporally reversed frames as negative samples. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context, comprising four datasets – including an ego tennis forehand dataset we collected, along with dense per-frame labels we annotated for each dataset. On the four datasets, our AE2 method strongly outperforms prior work in a variety of fine-grained downstream tasks, both in regular and cross-view settings.

READ FULL TEXT

page 2

page 7

page 9

page 17

page 21

research
03/28/2022

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

Prior works on action representation learning mainly focus on designing ...
research
12/06/2022

Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Previous work on action representation learning focused on global repres...
research
03/31/2021

Learning by Aligning Videos in Time

We present a self-supervised approach for learning video representations...
research
08/13/2020

What Should Not Be Contrastive in Contrastive Learning

Recent self-supervised contrastive methods have been able to produce imp...
research
05/31/2023

Learning by Aligning 2D Skeleton Sequences in Time

This paper presents a novel self-supervised temporal video alignment fra...
research
03/20/2023

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

We propose a self-supervised method for learning motion-focused video re...
research
01/05/2020

Spatial-Scale Aligned Network for Fine-Grained Recognition

Existing approaches for fine-grained visual recognition focus on learnin...

Please sign up or login with your details

Forgot password? Click here to reset