Distilling Knowledge from Language Models for Video-based Action Anticipation

10/12/2022
by   Sayontan Ghosh, et al.
0

Anticipating future actions in a video is useful for many autonomous and assistive technologies. Prior action anticipation work mostly treats this as a vision modality problem, where the models learn the task information primarily from the video features in the target action anticipation datasets. In this work, we propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets. In particular, we leverage pre-trained language models to build a text-modality teacher that is able to predict future actions based on text labels of the past actions extracted from the input video. To further adapt the teacher to the target domain (cooking), we also pretrain the teacher on textual instructions from a recipes dataset (Recipe1M). Then, we distill the knowledge gained by the text-modality teacher into a vision-modality student to further improve it's performance. We empirically evaluate this simple cross-modal distillation strategy on two video datasets EGTEA-GAZE+ and EPIC-KITCHEN 55. Distilling this text-modality knowledge into a strong vision model (Anticipative Vision Transformer) yields consistent gains across both datasets, 3.5 mean recall for EGTEA-GAZE+, 7.2 EPIC-KITCHEN 55 and achieves new state-of-the-results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2021

Feature-Supervised Action Modality Transfer

This paper strives for action recognition and detection in video modalit...
research
10/10/2019

Cross-modal knowledge distillation for action recognition

In this work, we address the problem how a network for action recognitio...
research
12/21/2021

Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix

In the context of multi-modality knowledge distillation research, the ex...
research
07/06/2021

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Since visual perception can give rich information beyond text descriptio...
research
08/07/2023

ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition

Video Action Recognition (VAR) is a challenging task due to its inherent...
research
04/04/2022

Analysis of Joint Speech-Text Embeddings for Semantic Matching

Embeddings play an important role in many recent end-to-end solutions fo...
research
02/28/2022

Multi-modal Alignment using Representation Codebook

Aligning signals from different modalities is an important step in visio...

Please sign up or login with your details

Forgot password? Click here to reset