M M Mix: A Multimodal Multiview Transformer Ensemble

06/20/2022
by   Xuehan Xiong, et al.
2

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M M) models varying backbone sizes and input modalities. Our approach achieved 52.8 higher than last year's winning entry.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/01/2023

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

In line with the human capacity to perceive the world by simultaneously ...
research
08/10/2023

Ensemble Modeling for Multimodal Visual Action Recognition

In this work, we propose an ensemble modeling approach for multimodal ac...
research
12/23/2019

DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition

In this work, we address the problem of learning an ensemble of speciali...
research
10/20/2022

Transformer-based Action recognition in hand-object interacting scenarios

This report describes the 2nd place solution to the ECCV 2022 Human Body...
research
09/10/2023

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

Various types of sensors have been considered to develop human action re...
research
06/13/2022

Multimodal Learning with Transformers: A Survey

Transformer is a promising neural network learner, and has achieved grea...
research
10/15/2021

StreaMulT: Streaming Multimodal Transformer for Heterogeneous and Arbitrary Long Sequential Data

This paper tackles the problem of processing and combining efficiently a...

Please sign up or login with your details

Forgot password? Click here to reset